目录

前言数据预处理导入库读取数据将五种癌症数据集合并

进行主成分分析计算样本均值计算样本协方差矩阵计算特征值和特征向量计算累计方差贡献率特征向量选取

结果可视化数据降维降维前降维后

写在最后

前言

是对一个数据挖掘作业的记录,数据集是老师提供的几种癌症的数据,我是直接在Jupyter中写的,中间会输出一些内容验证之类的

主要是参照这位写的:参考的大佬

数据预处理

导入库

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

读取数据

# 读取数据,查看数据的规模

BLCA_data = pd.read_csv(r'数据集/BLCA/rna.csv')

print('BLCA',BLCA_data.shape)

BRCA_data = pd.read_csv(r'数据集/BRCA/rna.csv')

print('BRCA',BRCA_data.shape)

KIRC_data = pd.read_csv(r'数据集/KIRC/rna.csv')

print('KIRC',KIRC_data.shape)

LUAD_data = pd.read_csv(r'数据集/LUAD/rna.csv')

print('LUAD',LUAD_data.shape)

PAAD_data = pd.read_csv(r'数据集/PAAD/rna.csv')

print('PAAD',PAAD_data.shape)

BLCA (3217, 400)

BRCA (3217, 1032)

KIRC (3217, 489)

LUAD (3217, 491)

PAAD (3217, 177)

# 查看其中一种类型癌症数据集的前6行

BLCA_data.head(6)

gene_idTCGA-HQ-A2OFTCGA-GU-A767TCGA-ZF-AA4RTCGA-DK-A1ACTCGA-DK-A3ITTCGA-GC-A3RDTCGA-BT-A0YXTCGA-FD-A6TETCGA-E7-A5KE...TCGA-BT-A0S7TCGA-K4-A6FZTCGA-E5-A2PCTCGA-DK-AA6QTCGA-XF-AAMYTCGA-CU-A0YOTCGA-E7-A7PWTCGA-S5-A6DXTCGA-GD-A76BTCGA-ZF-AA4V0A2BP1|54715-0.402918-1.236959-1.453230-1.211589-1.288438-1.262086-1.033565-1.183245-1.076254...-1.282740-1.023637-0.366722-1.161285-1.380649-1.205136-0.875586-1.059148-1.393928-1.2156411A2ML1|1445680.7175020.7378911.5846431.4719320.4140151.8210842.5158980.7238300.668703...0.1639031.3182290.8791350.5136031.5715571.6791991.222708-0.0798021.4803931.9473402ACTL6B|51412-1.185185-1.403906-1.453230-1.211589-1.288438-1.262086-1.196075-1.183245-0.937518...-1.282740-1.291842-1.036778-1.161285-1.380649-1.368904-1.567590-1.425863-1.393928-1.2156413ADAM6|87550.3738261.4112792.0922823.1864382.9674571.4371132.7237752.3200120.736963...2.3754131.9643691.2020001.8244192.1850512.1758491.4463523.5918723.2668182.5143854ADAMDEC1|27299-0.489828-0.894004-0.2051691.0151790.0895400.1149030.6644650.138955-1.076254...-0.408256-0.063113-0.1471730.177490-0.2540070.173888-0.9517120.5927500.2603820.7588975ALDH1A3|2200.4201700.5634041.0642570.4608611.5182041.8093201.2309121.4882750.809302...1.1800981.7519190.0348351.5120290.8514591.6452710.3982170.599149-0.0642561.339531

6 rows × 400 columns

# 查看另一种癌症数据集的前6行

BRCA_data.head(6)

gene_idTCGA-D8-A1XJTCGA-EW-A6SATCGA-D8-A27HTCGA-E9-A247TCGA-AC-A2FGTCGA-B6-A0RPTCGA-BH-A0DHTCGA-AN-A0AMTCGA-D8-A1XZ...TCGA-BH-A18GTCGA-E2-A15JTCGA-BH-A1F6TCGA-EW-A1P8TCGA-AO-A1KTTCGA-A2-A0D0TCGA-3C-AALKTCGA-AO-A1KSTCGA-EW-A1OWTCGA-OL-A5RU0A2BP1|54715-1.069946-1.075002-1.377040-1.307689-1.292405-1.231926-1.231272-1.396638-1.229635...-1.079587-1.018372-1.355431-1.156045-1.337455-1.189837-1.467466-1.339815-1.315667-1.4213731A2ML1|1445680.255748-0.8772501.623317-1.307689-0.718106-1.135720-1.2312720.2535690.136528...-0.496468-1.1358401.8704740.995603-0.732074-0.500924-1.075625-0.7083170.481910-0.7818412ACTL6B|51412-1.253764-1.075002-1.377040-1.307689-1.458347-1.231926-1.231272-1.396638-1.229635...-1.079587-1.135840-1.200037-1.277135-1.337455-1.189837-1.467466-1.339815-1.162777-1.4213733ADAM6|87551.9493212.6036302.0166461.9368392.8512062.1593153.4946722.6651963.116796...1.5030651.5235133.2383822.4766802.0394491.6424552.5295461.5635752.5194193.5039254ADAMDEC1|27299-0.219076-0.343607-0.1101690.410611-0.181777-0.254802-0.0005190.2185260.521942...-0.0276350.0736841.2149230.885440-0.1456220.4669230.3306900.4859080.6840710.7238615ALDH1A3|2201.1385330.2906962.2392960.7558491.3271230.7404711.1014511.0915980.884671...2.2519041.1351051.9782011.8279460.8346571.2521291.1592140.9497911.2150540.805760

6 rows × 1032 columns

经过上面的分析可以知道对于不同的数据集,它们的行数和行标签都是相同的,但列数和列标签都是不同的 所以对于这些数据集来说,行标签可以看作是数据的不同特征,而每一列则对应一个样本的数据

保留第一个数据集的第一列,然后将其与后面几个数据集除第一列之外的列进行合并 则:

BLCA:1-400

BRCA:401-1431

KIRC:1432-1919

LUAD:1920-2409

PAAD:2410-2585

将五种癌症数据集合并

# 将5种癌症数据集的除gene_id列之外的列进行合并

DataSet = pd.concat([BLCA_data,BRCA_data.iloc[:,1:],KIRC_data.iloc[:,1:],LUAD_data.iloc[:,1:],PAAD_data.iloc[:,1:]], axis=1)

DataSet.head(6)

gene_idTCGA-HQ-A2OFTCGA-GU-A767TCGA-ZF-AA4RTCGA-DK-A1ACTCGA-DK-A3ITTCGA-GC-A3RDTCGA-BT-A0YXTCGA-FD-A6TETCGA-E7-A5KE...TCGA-FB-A7DRTCGA-H8-A6C1TCGA-LB-A8F3TCGA-IB-7893TCGA-2J-AABFTCGA-US-A77JTCGA-F2-7273TCGA-HZ-A8P0TCGA-3A-A9IHTCGA-2L-AAQA0A2BP1|54715-0.402918-1.236959-1.453230-1.211589-1.288438-1.262086-1.033565-1.183245-1.076254...-1.046787-1.561721-1.256692-1.345364-1.279792-1.621289-1.054519-1.455548-1.464518-1.4719271A2ML1|1445680.7175020.7378911.5846431.4719320.4140151.8210842.5158980.7238300.668703...-1.226226-0.680184-1.0383661.252544-0.752025-1.621289-1.5047790.6809930.666947-0.9154102ACTL6B|51412-1.185185-1.403906-1.453230-1.211589-1.288438-1.262086-1.196075-1.183245-0.937518...-1.126668-1.141486-0.768676-1.013273-0.542150-0.338250-0.368487-1.455548-0.454760-1.4719273ADAM6|87550.3738261.4112792.0922823.1864382.9674571.4371132.7237752.3200120.736963...2.5005392.5343271.8954972.5237382.7714873.5085402.5030073.0311952.0188572.1713314ADAMDEC1|27299-0.489828-0.894004-0.2051691.0151790.0895400.1149030.6644650.138955-1.076254...0.3816450.286921-0.266635-0.1615850.3809470.7720600.254188-0.122049-0.5961580.0988885ALDH1A3|2200.4201700.5634041.0642570.4608611.5182041.8093201.2309121.4882750.809302...1.5305321.1073930.6186891.2324311.0621131.0352721.2203921.0525040.9858690.812195

6 rows × 2585 columns

对数据集进行转置,之后每一行对应一个样本数据,每一列表示一个样本特征

DataSet = DataSet.T

DataSet.head(6)

0123456789...3207320832093210321132123213321432153216gene_idA2BP1|54715A2ML1|144568ACTL6B|51412ADAM6|8755ADAMDEC1|27299ALDH1A3|220ATCAY|85300ATP10B|23120BCAS1|8537C10orf105|414152...SLC38A5|92745SLC44A3|126969SSX4|6759TMEM200B|399474TMEM26|219623TNFRSF10D|8793XAF1|54739YJEFN3|374887ZMAT1|84460ZNF415|55786TCGA-HQ-A2OF-0.4029180.717502-1.1851850.373826-0.4898280.42017-1.1851851.8414921.327818-0.301422...0.2758461.228001-0.959773-0.5305590.6370471.3397060.5203050.235093-0.3839920.795605TCGA-GU-A767-1.2369590.737891-1.4039061.411279-0.8940040.563404-1.2369591.3099582.0158290.416162...0.3643311.484526-1.4039060.689511-0.4742660.3785060.9946470.7765980.3343630.577015TCGA-ZF-AA4R-1.453231.584643-1.453232.092282-0.2051691.064257-1.2679460.2074750.764633-0.786075...1.3500111.03154-1.453230.867536-0.1455150.229991.2100680.3030780.912720.70952TCGA-DK-A1AC-1.2115891.471932-1.2115893.1864381.0151790.460861-0.9958341.5763760.546906-0.855604...0.6376911.190068-0.3486510.0346410.0309730.093971.0895230.5838130.5031980.733853TCGA-DK-A3IT-1.2884380.414015-1.2884382.9674570.089541.518204-1.159105-0.0351451.5758490.220414...0.5981911.479815-1.2884380.921235-0.30770.4860691.3115230.4831390.2625080.589681

6 rows × 3217 columns

# 此时第一行为特征标签,将非数据的这一行去掉

DataSet = DataSet.iloc[1:,:]

DataSet.iloc[:6,:]

0123456789...3207320832093210321132123213321432153216TCGA-HQ-A2OF-0.4029180.717502-1.1851850.373826-0.4898280.420170-1.1851851.8414921.327818-0.301422...0.2758461.228001-0.959773-0.5305590.6370471.3397060.5203050.235093-0.3839920.795605TCGA-GU-A767-1.2369590.737891-1.4039061.411279-0.8940040.563404-1.2369591.3099582.0158290.416162...0.3643311.484526-1.4039060.689511-0.4742660.3785060.9946470.7765980.3343630.577015TCGA-ZF-AA4R-1.4532301.584643-1.4532302.092282-0.2051691.064257-1.2679460.2074750.764633-0.786075...1.3500111.031541-1.4532300.867536-0.1455150.2299901.2100680.3030780.9127200.709520TCGA-DK-A1AC-1.2115891.471932-1.2115893.1864391.0151790.460861-0.9958341.5763760.546906-0.855604...0.6376911.190068-0.3486510.0346410.0309730.0939701.0895230.5838130.5031980.733853TCGA-DK-A3IT-1.2884380.414015-1.2884382.9674570.0895401.518204-1.159105-0.0351451.5758490.220414...0.5981911.479815-1.2884380.921235-0.3077000.4860691.3115220.4831390.2625080.589681TCGA-GC-A3RD-1.2620871.821084-1.2620871.4371130.1149031.809320-0.9936030.8784512.0582530.036055...1.0695561.383669-0.2155580.234100-0.3244770.3796891.323748-0.0261470.6381230.096817

6 rows × 3217 columns

# 重新设置数据类型,此时DataSet为一个2584x3217的浮点数矩阵

DataSet = DataSet.astype('float32')

print(DataSet.shape)

(2584, 3217)

进行主成分分析

计算样本均值

Mean_vec = np.mean(DataSet, axis=0)

print('样本的均值:\n',Mean_vec)

样本的均值:

0 -1.161548

1 -0.474133

2 -1.251174

3 2.412248

4 0.266737

...

3212 0.521786

3213 1.303051

3214 0.133573

3215 0.774802

3216 0.601425

Length: 3217, dtype: float32

计算样本协方差矩阵

'''

# 纯数据运算

Cov_mat = (DataSet-Mean_vec).T.dot(DataSet-Mean_vec)/(DataSet.shape[0]-1)

print('协方差矩阵:\n',Cov_mat)

'''

# 使用numpy中计算协方差的函数

'''

协方差矩阵用于衡量两个变量之间相互依赖的程度

由于有3217个feature,故协方差矩阵的规模为3217x3217

'''

Cov_mat = np.cov(DataSet.T)

print('样本协方差矩阵:\n',Cov_mat)

Cov_mat.shape

样本协方差矩阵:

[[ 0.12676448 -0.02361515 0.02157561 ... 0.00519793 0.00904992

0.00673015]

[-0.02361515 1.00118482 -0.0053748 ... 0.07527039 -0.17714864

0.0090866 ]

[ 0.02157561 -0.0053748 0.13757093 ... 0.00533166 -0.00134267

0.00309703]

...

[ 0.00519793 0.07527039 0.00533166 ... 0.22596547 0.02969978

0.01703536]

[ 0.00904992 -0.17714864 -0.00134267 ... 0.02969978 0.20364439

0.04410373]

[ 0.00673015 0.0090866 0.00309703 ... 0.01703536 0.04410373

0.13098872]]

(3217, 3217)

计算特征值和特征向量

EigenValues, EigenVector = np.linalg.eig(Cov_mat)

# 数据特征量较大,这里计算结果为复数,统一取其实部

EigenValues = EigenValues.real

EigenVector = EigenVector.real

print('特征值:\n',EigenValues)

print('特征向量:\n',EigenVector)

特征值:

[ 2.31279350e+02 1.46770122e+02 8.81143724e+01 ... 2.86122668e-17

2.84073038e-17 -3.12152010e-17]

特征向量:

[[ 0.00030864 -0.00113702 -0.00123991 ... -0.00242919 -0.00277321

0.00190791]

[-0.02456387 0.02533867 -0.05603828 ... 0.00124872 0.00140361

-0.00075202]

[-0.00023562 -0.00033786 -0.00340553 ... -0.00034539 -0.00039286

0.00087827]

...

[-0.00238575 0.00450426 -0.01746374 ... -0.00422026 -0.00422903

0.00272333]

[ 0.00781593 -0.02092816 0.00775456 ... 0.01314091 0.01272775

0.01905482]

[-0.0030166 -0.01263667 -0.0111954 ... 0.02344654 0.03139416

0.01099676]]

# EigenPairs[i][0]表示一个特征值,EigenPairs[i][1]表示该特征值所对应的特征向量

EigenPairs = [(np.abs(EigenValues[i]), EigenVector[:,i]) for i in range(len(EigenValues))]

EigenPairs.sort(key=lambda x:x[0], reverse=True)

print('特征值降序排列:')

for i in EigenPairs:

print(i[0])

特征值降序排列:

231.2793503404339

146.77012152572306

88.11437236590861

......

1.6398199526831266e-18

1.4276145590349296e-18

1.4276145590349296e-18

计算累计方差贡献率

# 计算方差贡献率

tot = sum(EigenValues)

var_exp = [(i/tot)*100 for i in sorted(EigenValues, reverse=True)]

print(var_exp)

# 累计方差贡献率

Cum_var_exp = np.cumsum(var_exp)

Cum_var_exp

[18.501009256397158, 11.740760136661512, ... , -1.3125279499634018e-16, -2.0754934203007648e-16]

array([ 18.50100926, 30.24176939, 37.29040913, ..., 100., 100. , 100.])

特征向量选取

## 这里为方便作图,选取特征值较大的前两维的特征向量

Matrix = np.hstack((EigenPairs[0][1].reshape(DataSet.shape[1], 1),

EigenPairs[1][1].reshape(DataSet.shape[1], 1)))

print('Matrix:\n',Matrix)

Matrix:

[[ 0.00030864 -0.00113702]

[-0.02456387 0.02533867]

[-0.00023562 -0.00033786]

...

[-0.00238575 0.00450426]

[ 0.00781593 -0.02092816]

[-0.0030166 -0.01263667]]

结果可视化

数据降维

# tmp主要是将DaraSet转成array类型

HighDimSet = DataSet.iloc[:,:].values

# 将原数据矩阵乘以选取的特征向量进行降维

LowDimSet = HighDimSet.dot(Matrix)

LowDimSet

array([[ -2.78179211, 7.81840477],

[-13.39504631, 7.30085955],

[-16.8183728 , 3.87839756],

...,

[ -5.81835218, 8.30323684],

[ -8.52738732, 8.17363152],

[ -6.97177986, 11.89669388]])

降维前

# LabelSet标识哪一行样本属于哪种癌症

LabelSet = ['BLCA']*399+['BRCA']*1031+['KIRC']*488+['LUAD']*490+['PAAD']*176

LabelSet = np.array(LabelSet)

LabelSet

array(['BLCA', 'BLCA', 'BLCA', ..., 'PAAD', 'PAAD', 'PAAD'], dtype='

# 选取两种特征对不同的癌症种类进行区分

plt.figure(figsize=(10,6), dpi=80)

plt.xlim(-3,3)

plt.ylim(-3,3)

for lab,color in zip(('BLCA', 'BRCA', 'KIRC', 'LUAD', 'PAAD'),

('red','yellow','green','blue','purple')):

plt.scatter(HighDimSet[LabelSet==lab,0],

HighDimSet[LabelSet==lab,1],

label=lab,c=color)

plt.xlabel('feature 1')

plt.ylabel('feature 2')

plt.legend(loc='upper right')

plt.show()

降维后

plt.figure(figsize=(10,6), dpi=80)

plt.xlim(-25,40)

plt.ylim(-30,25)

for lab,color in zip(('BLCA', 'BRCA', 'KIRC', 'LUAD', 'PAAD'),

('red','yellow','green','blue','purple')):

plt.scatter(LowDimSet[LabelSet==lab,0],

LowDimSet[LabelSet==lab,1],

label=lab,c=color)

plt.xlabel('principal component 1')

plt.ylabel('principal component 2')

plt.legend(loc='upper right')

plt.show()

写在最后

文章仅作记录,至于原理还有很多不懂的地方,结果我也不知道该是什么样的,把用这么多维特征来区分的事物降到两维来进行区分,我自己感觉已经很神奇了哈哈!

参考文章

评论可见,请评论后查看内容,谢谢!!!
 您阅读本篇文章共花了: