网站首页 > 后端 > 正文

数据分析 python的pandas和numpy数据处理基础知识总结(各种干货和常用方法，使用鸢尾花数据集)

大数据技术与应用后端 2024-01-02 19 0

本次处理使用的是iris(鸢尾花)数据集，可在网上获取，下载之后放在代码运行的目录下

前四列为花的特征，最后一列species表示花的品种包括：Iris-setosa，Iris-versicolor，Iris-virginica

1.导入库并读取数据

import numpy as np

import pandas as pd

df=pd.read_csv("./data/iris.csv",header=0) #读取CSV数据，header=0是将第一行作为列名(表头)

#df=pd.read_excel("./data/test.xlsx",header=none)#读取excel数据，none表示没有列名（表头）

2.对df表格数据进行获取，主要方法有:df.loc(), df.iloc()，df.colname，df[‘列名’]

data1=df.iloc[1,2]#表示获取第一行第二列的数据 iloc的参数是根据索引获取值

data2=df.loc[:6,['sepal_width','petal_length']]#loc可以根据列名获取值，data2的数据类型为DataFrame表格，表示获取sepal_width和petal_length列前6行数据

data3=df.species#表示获取名为species列的全部数据

3.对数据进行打乱和重新排序

np.random.permutation(df.index) ：有返回值，df.index获取的是数据表格的行索引值np.random.shuffle(df.index) :无返回值，只在原数据进行修改df.sample(frac=1):有返回值，参数frac是要返回的比例，1表示返回全部，比如df中有100行数据，只想返回其中的50%,那么frac=0.5.df.sample(frac=1).reset_index(drop=True) #将打乱的数据还是按照正常序列排序，不采用原来的行索引

# 打乱数据顺序方法1

np.random.seed(10) #设置种子数

# np.random.shuffle(df.index)

shuffled_rows = np.random.permutation(df.index)

#根据打乱的顺序重新得到数据

df2 = df.iloc[shuffled_rows]

#打乱顺序方法2

df2=df.sample(frac=1).reset_index(drop=True)

4.条件处理与赋值

#条件处理与赋值

# 将Iris-versicolor类标签设置为1，Iris-setosa设置为0和Iris-virginica设置为2

df2.loc[df2.species=='Iris-setosa' ,'species' ] =0

df2.loc[df2.species=='Iris-versicolor' ,'species' ] =1

df2.loc[df2.species == 'Iris-virginica', 'species'] = 2

y=df2['species'].values #.values获取species列的所有数值并存储为数组类型

5.对df数据进行分组

df.group_by(‘列名’) ：按照列名进行分组df.species.unique() ：获取为唯一值get_group(‘Iris-setosa’) ：获取分组后列名为’Iris-setosa’的数据

flower=df.species.unique() #获取该列唯一数值

#输出flower:['Iris-setosa' 'Iris-versicolor' 'Iris-virginica']

groups=df.group_by('species')

setosa=groups.get_group('Iris-setosa') #获取'Iris-setosa'的数据

6.删除数据df.drop(,axis=)

axis=0表示按行删除

df.drop(1,axis=0) #删除索引为1的行

#删除列名为考试性质中数值为‘重修补考’ 或者 '正常考试(缓)'的值，用‘|’表示或 ‘ & ’表示与

data=df3.drop(df3[ (df3['考试性质']=='重修补考')|(df3['考试性质']=='正常考试(缓)') ].index,axis=0)

7.数据转化与np创建数组（创建多分类y的ont-hot编码）

df[‘species’].tolist()：将数据转换为列表np.array([[1,2,3],[3,4,5]]) :将列表存储为numpy数组类型np.zeros((150,3),dtype=int) ：创建全为0的大小为(150,3)数组np.ones((150,3),dtype=float)np.random.randint(low, high=None, size=None) 返回[low,high)的整数值，size表示大小，例如 =4或者(2,6)np.random.rand(4,2)根据正态分布返回(4，2)大小的数据

#创建多分类y的one-hot值

y1=df2['species'].tolist() #转换为列表例[0,1,1,2,1,2,0,1,0..........]

#鸢尾花数据集有三类，共有150条数据

y=np.zeros((150,3),dtype=int) #创建一个大小为（150，3）全为0的数组

y[range(len(y1)),y1]=1

data_df = pd.DataFrame(y) #将numpy数组转换为pd表格类型

8.保存数据 df.to_excel() , df.to_csv()

header为True表示有表头(列名)，index为False表示不需要行的索引名

df2.to_excel('./data/preprocess.xlsx',header=True,index=False)

金钥匙

数据分析 python的pandas和numpy数据处理基础知识总结(各种干货和常用方法，使用鸢尾花数据集)

使用Python进行数据分析时，CSV文件导入的两种方法 | numpy | pandas

python 数据分析【详解】Pandas入门之Series和DataFrame

发表评论取消回复

金钥匙

数据分析 python的pandas和numpy数据处理基础知识总结(各种干货和常用方法，使用鸢尾花数据集)

使用Python进行数据分析时，CSV文件导入的两种方法 | numpy | pandas

python 数据分析 【详解】Pandas入门之Series和DataFrame

相关文章

发表评论取消回复

python 数据分析【详解】Pandas入门之Series和DataFrame