Pands
Pandas:提供便于分析的数据类型,提供数据分析的各种函数 import pandas as pd pandas基于numpy实现,常与numpy和matplotlib一同使用提供的数据类型:Series(一维标签数据),DataFrame(二维-多维) 基于ndarray(数据的结构表达—维度)的扩展数据类型(应用表达—数据与索引之间)是基于索引的数据结构,对数据的操作基于对索引的操作
Series
Series类型:由一组数据及与之相关的数据索引组成 自动索引,自定义索引生成series:多种方法基本操作: a.index a.values a["a"] a[0] 切片a[1:] 判断索引是否在series列表中"c" in a 两个或多个series之间的对齐操作:对相同的索引进行对齐 series修改a["a"]=9,随时修改,随时应用
import pandas as pd
import numpy as np
# python列表创建
a = pd.Series([1,2,3,4], index=["a","b","c","d"])
print(a)
# 标量值创建,不能省略index
b = pd.Series(2, index=["a","b","c","d"])
print(b)
# 字典类型创建
c = pd.Series({"a":1,"b":2})
print(c)
d = pd.Series({"a":1,"b":2}, index = {"c", "a", "b"})# 索引指定挑取值
print(d)
# ndarray创建
e = pd.Series(np.arange(5))
print(e)
f = pd.Series(np.arange(5), index=np.arange(9,4,-1))# 创建值-索引
print(f)
# 基本操作—索引与值的读取
a = pd.Series([1,2,3,4], ["a","b","c","d"])
print(a.index)
print(a.values)
print(a["b"])# 两种索引可以单独使用,但不可混合使用
print(a[1:3])
"c" in a
a.get("f",100)
a 1
b 2
c 3
d 4
dtype: int64
a 2
b 2
c 2
d 2
dtype: int64
a 1
b 2
dtype: int64
b 2.0
a 1.0
c NaN
dtype: float64
0 0
1 1
2 2
3 3
4 4
dtype: int32
9 0
8 1
7 2
6 3
5 4
dtype: int32
Index(['a', 'b', 'c', 'd'], dtype='object')
[1 2 3 4]
2
b 2
c 3
dtype: int64
100
DataFrame类型
共用同一索引的多列表格 index(行索引)—Column(列索引),0开始创建:
import pandas as pd
import numpy as np
# 从二维ndarray创建
a = pd.DataFrame(np.arange(10).reshape(2,5))
print(a)
# 从字典创建
b = {"one":pd.Series([1,2,3],index=["a","b","c"]),
"two":pd.Series([6,7,8,9], index=["a","b","c","d"])}
c = pd.DataFrame(b)
print(c)
print(pd.DataFrame(b, index=["a","d"],columns=["one"]))
# 从列表类型的字典创建
dl = {"one":[1,2,3],"two":[6,7,8]}
d = pd.DataFrame(dl, index=["a","b","c"])
print(d)
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
one two
a 1.0 6
b 2.0 7
c 3.0 8
d NaN 9
one
a 1.0
d NaN
one two
a 1 6
b 2 7
c 3 8
Pandas库的数据类型操作
改变结构: 增加或重排:重新索引,reindex 删除:drop fill_value:填充缺失值索引类型的操作: .append(idx):连接另一个index对象 .diff(idx):计算差集,产生新的index对象 ……
import numpy as np
import pandas as pd
dl = {"one":[1,2,3],"two":[6,7,8],"three":[4,5,9]}
d = pd.DataFrame(dl, index=["a","b","c"])
print(d)
print(d.drop("a"))
print(d.drop("one",axis=1))# axis=1代表横向
d = d.reindex(index=["b","c","a"])# 行重排
print(d)
d = d.reindex(columns=["three","one","two"])# 列重排
print(d)
# f = d.columns.insert(4,"新增")
# f = d.reindex(columns+f, fill_value=20)
# print(f)
# 索引的操作
nc = d.columns.delete(2)
print(nc)
ni = d.index.insert(3,"m")
print(ni)
nd = d.reindex(index=ni,columns=nc)
print(nd)
n = pd.Series([1,2,3,4],index=["j","k","l","o"])
print(n)
print(n.drop(["j"]))# .drop函数会产生新的series,而不改变原来的series
print(n)
one two three
a 1 6 4
b 2 7 5
c 3 8 9
one two three
b 2 7 5
c 3 8 9
two three
a 6 4
b 7 5
c 8 9
one two three
b 2 7 5
c 3 8 9
a 1 6 4
three one two
b 5 2 7
c 9 3 8
a 4 1 6
Index(['three', 'one'], dtype='object')
Index(['b', 'c', 'a', 'm'], dtype='object')
three one
b 5.0 2.0
c 9.0 3.0
a 4.0 1.0
m NaN NaN
j 1
k 2
l 3
o 4
dtype: int64
k 2
l 3
o 4
dtype: int64
j 1
k 2
l 3
o 4
dtype: int64
Pandas算术运算
广播运算:不同维度,不同尺寸就补齐(NaN)后运算,值为NaN四则运算:符号运算,参数运算,两种方式series与DataFrame之间的运算:series默认在axis=1参与运算比较运算:同维度运算,需要尺寸一致;不同维度,默认在一轴
import pandas as pd
import numpy as np
a = pd.DataFrame(np.arange(12).reshape(3,4))
print(a)
b = pd.DataFrame(np.arange(20).reshape(4,5))
print(b)
print(a+b)# 出现补齐运算
# 四则运算使用参数进行运算,好处是可以增加参数
print(a.add(b,fill_value=10))# 将缺失值补为某个确定的值
c = pd.Series(np.arange(4))
print(c)
print(b-c)# series默认在axis=1参与运算
# 比较运算
# print(a>b) # 报错
print(a>c)
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
0 1 2 3 4
0 0 1 2 3 4
1 5 6 7 8 9
2 10 11 12 13 14
3 15 16 17 18 19
0 1 2 3 4
0 0.0 2.0 4.0 6.0 NaN
1 9.0 11.0 13.0 15.0 NaN
2 18.0 20.0 22.0 24.0 NaN
3 NaN NaN NaN NaN NaN
0 1 2 3 4
0 0.0 2.0 4.0 6.0 14.0
1 9.0 11.0 13.0 15.0 19.0
2 18.0 20.0 22.0 24.0 24.0
3 25.0 26.0 27.0 28.0 29.0
0 0
1 1
2 2
3 3
dtype: int32
0 1 2 3 4
0 0.0 0.0 0.0 0.0 NaN
1 5.0 5.0 5.0 5.0 NaN
2 10.0 10.0 10.0 10.0 NaN
3 15.0 15.0 15.0 15.0 NaN
0 1 2 3
0 False False False False
1 True True True True
2 True True True True
数据的排序
指定轴上进行索引排序.sort_index():默认升序,默认是0轴操作,也就是纵向,指定1,横向操作指定轴上根据数值进行排序.sort_values():默认升序,默认是0轴操作
import pandas as pd
import numpy as np
# 索引排序
a = pd.DataFrame(np.arange(12).reshape(3,4), index=["a","b","c"])
print(a)
b = a.sort_index(ascending=False)# 默认在0轴操作
print(b)
c = a.sort_index(axis=1, ascending=False)
print(c)
# 值排序
d = a.sort_values(2, ascending=False)# 根据column=2这一列进行排序
print(d)
e = a.sort_values("a", axis=1, ascending=False)# 根据index="a"这一行进行排序
print(e)
0 1 2 3
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
0 1 2 3
c 8 9 10 11
b 4 5 6 7
a 0 1 2 3
3 2 1 0
a 3 2 1 0
b 7 6 5 4
c 11 10 9 8
0 1 2 3
c 8 9 10 11
b 4 5 6 7
a 0 1 2 3
3 2 1 0
a 3 2 1 0
b 7 6 5 4
c 11 10 9 8
数据基本统计分析
一些函数:.sum()...... .decribe():包含多种信息
import pandas as pd
import numpy as np
# series
a = pd.Series(np.arange(3), index=["a","b","c"])
print(a)
print(a.describe())# 是包含多种计算的series类型,可以根据索引获得其中的值
print(a.describe()["mean"])
# dataframe类型
b = pd.DataFrame(np.arange(12).reshape(3,4), index=["a","b","c"])
print(b.describe())
print(b.describe()[2])
a 0
b 1
c 2
dtype: int32
count 3.0
mean 1.0
std 1.0
min 0.0
25% 0.5
50% 1.0
75% 1.5
max 2.0
dtype: float64
1.0
0 1 2 3
count 3.0 3.0 3.0 3.0
mean 4.0 5.0 6.0 7.0
std 4.0 4.0 4.0 4.0
min 0.0 1.0 2.0 3.0
25% 2.0 3.0 4.0 5.0
50% 4.0 5.0 6.0 7.0
75% 6.0 7.0 8.0 9.0
max 8.0 9.0 10.0 11.0
count 3.0
mean 6.0
std 4.0
min 2.0
25% 4.0
50% 6.0
75% 8.0
max 10.0
Name: 2, dtype: float64
累计统计
前n个元素累计求和,运算 窗口计算:滚动计算
import numpy as np
import pandas as pd
b = pd.DataFrame(np.arange(12).reshape(3,4), index=["a","b","c"])
print(b)
print(b.cumsum())# 默认0轴
print(b.cummin())
print(b.rolling(2).sum())# 凑不够相邻元素的就NaN,
0 1 2 3
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
0 1 2 3
a 0 1 2 3
b 4 6 8 10
c 12 15 18 21
0 1 2 3
a 0 1 2 3
b 0 1 2 3
c 0 1 2 3
0 1 2 3
a NaN NaN NaN NaN
b 4.0 6.0 8.0 10.0
c 12.0 14.0 16.0 18.0
数据的相关分析
协方差>0,正相关:.cov(),协方差矩阵 Pearson相关系数:.corr(),相关系矩阵
相关文章
发表评论