python 开发语言 pandas之数据结构

1 pandas数据类型

pandas包含两种数据类型：series和dataframe。 series是一种一维数据结构，每一个元素都带有一个索引，与一维数组的含义相似，其中索引可以为数字或字符串。series结构名称：

dataframe是一种二维数据结构，数据以表格形式（与excel类似）存储，有对应的行和列。dataframe结构名称：

以上内容借鉴：

非常全面的Pandas入门教程_pandas教程_Summer1Li的博客-CSDN博客 ————————————————

2 创建数据

2.1 创建Series

Series类似于一维数组的数据结构，能够保存任何类型的数据，如整数、字符串、浮点数等，主要由一组数据和与之相关的索引两部分构成。

import pandas as pd

#不定义索引，默认创建从0-n的整数索引

s = pd.Series(['wang','23','1999'])

print(s)

#单独附加定义索引

s.index=['name','age','year']

print(s)

print('------------分隔-----------')

#直接定义索引名称：index 参数

s2 = pd.Series(['wang','23','1999'],index=['name','age','year'])

print(s2)

print('------------分隔-----------')

#也可以使用字典键值对来创建,注意字典的格式{}

s3 = pd.Series({'name':'wang','age':23,'year':1999})

print(s3)

#简单使用索引获取数据

print(s3.iloc[0])

print(s3.loc['name'])

2.2 创建DataFrame

DataFrame类似于二维数组或表格(如excel)的对象，既有行索引，又有列索引，常使用字典创建，键值对：键代表列名，值是列的内容。

import pandas as pd

#不定义索引，默认0-n，常使用字典创建，键值对{}

message = pd.DataFrame({

'name':['wang','sun','zhang','li'],

'age':['22','23','21','22'],

'country':['china','china','nbc','english'],

'score':['65','76','87','92']

})

print(message)

print('------------分隔-----------')

#使用index参数指定行索引，columns参数指定列顺序

message1 = pd.DataFrame(

data={

'age':['22','23','21','22'],

'country':['china','china','nbc','english'],

'score':['65','76','87','92']

index=['wang','sun','zhang','li'],

columns=['age','score','country']

)

print(message1)

'''

name age country score

0 wang 22 china 65

1 sun 23 china 76

2 zhang 21 nbc 87

3 li 22 english 92

------------分隔-----------

age score country

wang 22 65 china

sun 23 76 china

zhang 21 87 nbc

li 22 92 english

Process finished with exit code 0

'''

3 Series

3.1 使用DataFrame 的loc属性获取数据的某一行，就会得到一个Series对象

index 输出’索引‘

index[0] 在方法之后加入方括号指定某一’索引‘

keys() 和 index 作用相同，查看’索引‘

keys()[0] 在方法之后加入方括号指定某一’索引‘

values 查看Series 值

size 查看 serise 中元素的数量

import pandas as pd

#使用DataFrame 的loc属性获取数据的某一行，就会得到一个Series对象

#先创建一个DataFrame

message = pd.DataFrame(

data={

'age':['22','23','21','22'],

'country':['china','china','nbc','english'],

'score':['65','76','87','92']

index=['wang','sun','zhang','li'],

columns=['age','score','country']

)

print(message)

print('------------间隔-----------')

#使用行索引，获取某一行，查看属性发现是Series属性

#或者使用列索引，获取某一列，例如 all_age = message['age']

first_people = message.loc['wang']

print(first_people) #此时输出两列，第一列’索引‘，第二列’值‘

print(type(first_people)) #type()属性

print('------------间隔-----------')

print(first_people.index) #.index 输出’索引‘

print(first_people.index[1]) #.index[0] 在方法之后加入方括号指定某一’索引‘

print(first_people.keys()) #keys() 和 index 作用相同，查看’索引‘

print(first_people.keys()[1]) #keys()[0] 在方法之后加入方括号指定某一’索引‘

print(first_people.values) #values 查看Series 值

print(first_people.size) #size 查看 serise 中元素的数量

'''

age score country

wang 22 65 china

sun 23 76 china

zhang 21 87 nbc

li 22 92 english

------------间隔-----------

age 22

score 65

country china

Name: wang, dtype: object

------------间隔-----------

Index(['age', 'score', 'country'], dtype='object')

score

Index(['age', 'score', 'country'], dtype='object')

score

['22' '65' 'china']

'''

3.2 Series的一些方法，类似于numpy.ndarray

下面使用mean()和std() 输出平均值、标准差会出错，因为开始使用字典创建age列时为'age':['22','23','21','22'] 此时是object类型而不是int64整型，应改为'age':[22,23,21,22]

import pandas as pd

#使用DataFrame 的loc属性获取数据的某一行，就会得到一个Series对象

#先创建一个DataFrame

message = pd.DataFrame(

data={

'age':['22','23','21','22'],

'country':['china','china','nbc','english'],

'score':['65','76','87','92']

index=['wang','sun','zhang','li'],

columns=['age','score','country']

)

print(message)

print('------------间隔-----------')

#使用列索引，获取某一列，查看属性是Series属性

all_age = message['age']

print(all_age) #输出两列，第一列’索引‘，第二列’值‘

print('------------间隔-----------')

print(all_age.mean()) #平均值 #？？？输出结果错误？？？

print(all_age.min()) #最小值

print(all_age.max()) #最大值

print(all_age.sample()) #随机值

#print(all_age.std()) #标准差

'''

wang 22

sun 23

zhang 21

li 22

Name: age, dtype: object

------------间隔-----------

5558030.5

li 22

'''

3.3 布尔子集

import pandas as pd

all_age = pd.Series([22, 24, 13, 22, 21, 12])

print(all_age)

print('------------间隔-----------')

print(all_age.describe()) #调用describe()方法获得多项统计数据

print(all_age.mean()) #获得所有年龄的平均值

print(all_age[all_age > all_age.mean()]) #方法一获得所有大于平均值的年龄

print(all_age > all_age.mean()) #方法二使用这条语句，返回一个Series，可以查看dtype为bool ，与上面语句输出结果比较一下区别

#手动提供布尔向量来获取某些数据

age_values = [True,True,False,False,False,False] #获取索引0、1的年龄

print(all_age[age_values])

'''

0 22

1 24

2 13

3 22

4 21

5 12

dtype: int64

------------间隔-----------

count 6.000000

mean 19.000000

std 5.138093

min 12.000000

25% 15.000000

50% 21.500000

75% 22.000000

max 24.000000

dtype: float64

19.0

0 22

1 24

3 22

4 21

dtype: int64

0 True

1 True

2 False

3 True

4 True

5 False

dtype: bool

0 22

1 24

dtype: int64

'''

3.4 操作向量运算

import pandas as pd

all_age = pd.Series([22, 24, 13, 22, 21, 12])

print(all_age)

print('------------间隔-----------')

#同长度向量运算

print(all_age+all_age)

print(all_age*all_age)

#向量与整数标量运算

print(all_age+100)

print(all_age*2)

#不同长度向量运算：当运算向量只有前三项有值，其余元素会被填充为‘缺失’，用NaN表示，指‘非数值’，这种处理方式为‘广播’

print(all_age + pd.Series([1,10,3]))

#带有默认索引标签的向量，会依据索引标签进行自动对其

#sort_index() 依据索引值排序，此时是正序，下面例子ascending=False表示降序

#sort_values() 依据值排序，使用方法类似

#例如 1.先顺序输出 2.使用sort_index()对索引号排序 3.将前两项相加，即在向量反向的情况下相加时根据索引标签自动对齐 #观察结果

print(all_age)

rev_ages = all_age.sort_index(ascending=False) #降序

print(rev_ages)

print(all_age+rev_ages)

'''

0 22

1 24

2 13

3 22

4 21

5 12

dtype: int64

------------间隔-----------

0 44

1 48

2 26

3 44

4 42

5 24

dtype: int64

0 484

1 576

2 169

3 484

4 441

5 144

dtype: int64

0 122

1 124

2 113

3 122

4 121

5 112

dtype: int64

0 44

1 48

2 26

3 44

4 42

5 24

dtype: int64

0 23.0

1 34.0

2 16.0

3 NaN

4 NaN

5 NaN

dtype: float64

0 22

1 24

2 13

3 22

4 21

5 12

dtype: int64

5 12

4 21

3 22

2 13

1 24

0 22

dtype: int64

0 44

1 48

2 26

3 44

4 42

5 24

dtype: int64

'''

4 DataFrame

布尔向量和操作自动对齐与向量化

import pandas as pd

#先创建一个DataFrame

message = pd.DataFrame(

data={

'age':[22,23,21,22],

'country':['china','china','nbc','english'],

'score':['65','76','87','92']

#index=['wang','sun','zhang','li'],

columns=['age','score','country']

)

print(message)

print('------------间隔-----------')

#布尔向量获取部分数据行

#Series 与 DataFrame 注意区别

#print(all_age[all_age > all_age.mean()]) #方法一获得所有大于平均值的年龄

#print(all_age > all_age.mean()) #方法二使用这条语句，返回一个Series，可以查看

print(message[message['age'] > message['age'].mean()])

#返回0、1 两行

print(message.loc[[True,True,False,False]])

print('------------间隔-----------')

#操作向量运算

first_half = message[:3] #切片，获取0、1、2

second_half = message[2:] #获取2、3

print(first_half)

print(second_half)

print(message+message) #每个元素会分别运算，数值会加倍，字符串会重复翻倍

5 添加更改列删除

5.1 添加

import pandas as pd

#先创建一个DataFrame

message = pd.DataFrame(

data={

'age':[22,23,21,22],

'country':['china','china','nbc','english'],

'score':['65','76','87','92'],

'born':[1999,2000,2009,1998],

'now':[2023,2023,2023,2023]

#index=['wang','sun','zhang','li'],

columns=['age','score','country','born','now']

)

print(message)

print('------------间隔-----------')

#添加列

#将出生日期和现在日期现在的object类型转换成datatime类型，利于下一步执行日期计算的操作

born_time = pd.to_datetime(message['born'], format='%Y') #日期年月日格式 '%Y-%m-%d'

print(born_time)

now_time = pd.to_datetime(message['now'], format='%Y')

print(now_time)

#转换成datatime类型的日期只是被赋值，并没有加入到message中，下面将born_time,now_time加入到message中添加列

message['born_time_1'],message['now_time'] = (born_time,now_time)

print(message)

'''

0 1999-01-01

1 2000-01-01

2 2009-01-01

3 1998-01-01

Name: born, dtype: datetime64[ns]

0 2023-01-01

1 2023-01-01

2 2023-01-01

3 2023-01-01

Name: now, dtype: datetime64[ns]

age score country born now born_time_1 now_time

0 22 65 china 1999 2023 1999-01-01 2023-01-01

1 23 76 china 2000 2023 2000-01-01 2023-01-01

2 21 87 nbc 2009 2023 2009-01-01 2023-01-01

3 22 92 english 1998 2023 1998-01-01 2023-01-01

'''

5.2 更改 astype()

#更改列

#通过now_time born_time 计算出差值，得出天数

message['age_day'] = (message['now_time'] - message['born_time_1'])

print(message)

#利用astype('timedelta64[Y]')方法将天数转换成年

message['age_year'] = message['age_day'].astype('timedelta64[Y]')

print(message)

'''

age score country born now born_time_1 now_time age_day

0 22 65 china 1999 2023 1999-01-01 2023-01-01 8766 days

1 23 76 china 2000 2023 2000-01-01 2023-01-01 8401 days

2 21 87 nbc 2009 2023 2009-01-01 2023-01-01 5113 days

3 22 92 english 1998 2023 1998-01-01 2023-01-01 9131 days

age score country born now born_time_1 now_time age_day age_year

0 22 65 china 1999 2023 1999-01-01 2023-01-01 8766 days 24.0

1 23 76 china 2000 2023 2000-01-01 2023-01-01 8401 days 23.0

2 21 87 nbc 2009 2023 2009-01-01 2023-01-01 5113 days 13.0

3 22 92 english 1998 2023 1998-01-01 2023-01-01 9131 days 24.0

Process finished with exit code 0

'''

5.3 删除 drop

#删除值

#删除行 drop默认删除行，加axis=1 为列

message = message.drop([2])

print(message)

#删除列，加axis=1 为列

message = message.drop(['age_day'],axis=1)

print(message)

'''

age score country born now born_time_1 now_time age_day age_year

0 22 65 china 1999 2023 1999-01-01 2023-01-01 8766 days 24.0

1 23 76 china 2000 2023 2000-01-01 2023-01-01 8401 days 23.0

3 22 92 english 1998 2023 1998-01-01 2023-01-01 9131 days 24.0

age score country born now born_time_1 now_time age_year

0 22 65 china 1999 2023 1999-01-01 2023-01-01 24.0

1 23 76 china 2000 2023 2000-01-01 2023-01-01 23.0

3 22 92 english 1998 2023 1998-01-01 2023-01-01 24.0

'''

6 导出导入数据

6.1 csv

#导入CSV文件

df = pandas.read_csv('D://postgraduatestudies/py/mypythonstudy/gapdata0.csv',encoding="utf-8",sep=',')

#保存CSV文件

#index=False用于控制是否写行名

message.to_csv('D://postgraduatestudies/py/mypythonstudy/message_3.csv',index=False)

有无index=False结果对比：

6.2 Excel

# 导出Excel文件

#Series结构不支持to_excel,保存前需要将结构转换成单列的DataFrame :to_frame()

#age_1 = age.to_frame()

#导出为xls文件

import xlwt #先安装pip install

message.to_excel('D://postgraduatestudies/py/mypythonstudy/message_4.xls',index=False)

#导出为xlsx文件

import openpyxl #先安装openpyxl

message.to_excel('D://postgraduatestudies/py/mypythonstudy/message_4.xlsl',index=False)

参考阅读

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

python 开发语言 pandas之数据结构

python 数据分析 pandas 提取DataFrame中每一行的DataFrame.itertuples()方法

python pandas中的索引index操作总结

发表评论取消回复

金钥匙

python 开发语言 pandas之数据结构

python 数据分析 pandas 提取DataFrame中每一行的DataFrame.itertuples()方法

python pandas中的索引index操作总结

相关文章

发表评论取消回复