机器学习技术（二）——Python科学运算模块（Numpy、Pandas）

文章目录

机器学习技术（二）——Python科学运算模块（Numpy、Pandas）一、Numpy1、介绍、安装与导入2、Numpy常用操作

二、Pandas1.导入相关依赖库2.创建Series3.创建DataFrame4.查看DataFrame中的数据5.DataFrame数据的操作6.Pandas中缺失数据的操作

三、总结

一、Numpy

1、介绍、安装与导入

使用NumPy库中的函数实现对数组的基本操作。对一些数据处理方法有所了解，在以后对机器学习实验中与处理数据以及底层代码实现打下基础。

首先使用“pip install 包名”的方式安装实验所需要的库：

pip install numpy

也可以下载相关包的源码文件进行安装。如果实验环境中已存在所需要的库

（使用pip list可查看已安装的库）则跳过此步骤

（Python中一些库在安装时需要事先安装一些依赖环境，如果使用pip命令安装失败，可以尝试安装依赖的库后再进行安装，或者使用源码安装）

2、Numpy常用操作

导入numpy，使用NumPy库中的方法生成数组，并查看数组的属性。

import numpy as np # 导入库时可以使用as来为为库起别名

np.zeros(shape, dtype=float, order='C')

创建一个形状为shape的全零数组。dtype为数据类型。order=C代表与c语言类似，行优先;order=F代表列优先

n = np.zeros([2,2]) # 创建一个2*2的全0数组

print(n)

''' 输出[[0. 0.]

[0. 0.]] '''

np.ones(shape, dtype=None, order='C')

创建一个全1数组，和np.zeros()类

m = np.ones([2,2]) # 创建一个2*2的全1数组

print(m)

'''输出[[1. 1.]

[1. 1.]]'''

np.eye(N, M=None, k=0, dtype=float, order='C')

生成一个对角矩阵，N为行数；M为列数，默认和N一样；k为对角线的索引，0代表主对角线。

e = np.eye(2) # 创建一个2X2的对角矩阵

print(e)

'''输出[[1. 0.]

[0. 1.]]'''s

numpy.empty(shape, dtype=float, order='C')

生成一个未初始化的数组。

print(np.empty([3,3]))

''' 输出[[0.00000000e+000 0.00000000e+000 0.00000000e+000]

[0.00000000e+000 0.00000000e+000 1.54148482e-321]

[8.34448956e-308 1.42404727e-306 1.44629801e-307]] '''

numpy.full(shape, fill_value, dtype=None, order='C')

二、Pandas

在Python中对数据进行处理时，经常会使用到Pandas这个库，它提供了诸多的数据处理方法和时间序列的操作方法，是当下最重要的Python 科学运算模块之一，也是提供高性能易用数据类型和分析工具。

1.导入相关依赖库

import pandas as pd

import numpy as np

2.创建Series

Series是Pandas中数据结构之一，它是一种类似于一维数组的对象，由一组数据以及一组与之相关的数据标签（即索引）组成。

pandas生成Series

pandas.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)生成一个Series数据。data为数据可以是数组和字典等；index为索引值，要求与数据长度相同，dtype为数据类型。

s = pd.Series([1,3,5,np.nan,6,8]) # nan 缺失值

print(s)

输出：

0 1.0

1 3.0

2 5.0

3 NaN

4 6.0

5 8.0

dtype: float64

通过ndarray创建一个Series

x data = np.array(['a','b','c','d'])s = pd.Series(data)print(s)

输出：

0 a

1 b

2 c

3 d

dtype: object

通过字典创建一个Series

字典（dict）可以作为输入传递，如果没有指定索引，则按排序顺序取得字典键以构造索引。如果传递了索引，索引中与标签对应的数据中的值将被取出。

data1 = {'a' : 0., 'b' : 1., 'c' : 2.}

s = pd.Series(data1)

print(s)

输出：

a 0.0

b 1.0

c 2.0

dtype: float64

3.创建DataFrame

DataFrame 是Pandas中最常用的数据结构，类似于数据库中的表，不过DataFrame不仅仅限制于2维，可以创建多维数据表。DataFrame既有行索引，也有列索引，可以看做是Series组成的字典，每个Series看做DataFrame的一个列。

pandas.date_range（start = None，end = None，periods = None，freq = "D"，tz = None，normalize = False，name = None，closed = None，**kwargs ）

生成一个时间序列的索引DatetimeIndex。start为日期起点，end为日期终点，periods为个数，freq表示间隔（D表示以日为间隔）,tz表示时区。

pandas.DataFrame(data=None, index=None, columns=None,dtype=None,copy=False)

生成一个DataFraem数据。data是数据，index是索引，columns是列名。

dates = pd.date_range('20230703', periods=7)# 生成作为行索引的时间序列

print(dates)

print("--"*16)

df = pd.DataFrame(np.random.randn(7,4), index=dates, columns=list('ABCD'))

print(df)

结果：

DatetimeIndex(['2023-07-03', '2023-07-04', '2023-07-05', '2023-07-06',

'2023-07-07', '2023-07-08', '2023-07-09'],

dtype='datetime64[ns]', freq='D')

--------------------------------

A B C D

2023-07-03 0.363595 0.296730 -1.701725 -0.323202

2023-07-04 0.337387 1.545662 -0.598071 0.399948

2023-07-05 0.561122 -0.244265 1.874373 -0.500115

2023-07-06 -0.045536 -0.440416 0.679049 -0.343199

2023-07-07 -1.767418 0.141762 -3.015787 1.072641

2023-07-08 -1.085728 0.034458 -0.180272 -0.870009

2023-07-09 0.993971 0.589281 -1.142692 1.395541

使用字典形式创建DataFrame

输入：

df1 = pd.DataFrame({ 'A' : 1.,

'B' : pd.Timestamp('20190610'), # Timestamp 方法生成时间戳

'C' : pd.Series(1,index=list(range(4)),dtype='float32'),

'D' : np.array([3] * 4,dtype='int32'),

# Categoricals 是 pandas 的一种数据类型，对应着被统计的变量，

# categorical 类型的数据可以具有特定的顺序，这个顺序是创建时手工设定的，是静态的

'E' : pd.Categorical(["test","train","test","train"]),

'F' : 'foo' })

print(df1)

输出：

A B C D E F

0 1.0 2023-07-03 1.0 3 test foo

1 1.0 2023-07-03 1.0 3 train foo

2 1.0 2023-07-03 1.0 3 test foo

3 1.0 2023-07-03 1.0 3 train foo

4.查看DataFrame中的数据

查看顶部和底部的数据

接收一个用户输入的分数，然后判断用户所输入的分数属于什么级别。使用Python中的if语句可以完成此功能。

输入：

data2 = np.arange(30).reshape(6,5) # 生成一个维度（6,5）的数组

df2 = pd.DataFrame(data2,index=['a','b','c','d' ,'e' , 'f'] , columns = ['A', 'B', 'C','D','E']) # 创建DataFrame

print(df2)

print("--" * 10)

# DataFrame.head(n=5)：显示前n条数据。n表示显示的数据量。

print(df2.head())

print("--" * 10)

# DataFrame.tail(n=5)：显示底部数据，和head的参数一样

print(df2.tail(3))

输出：

A B C D E

a 0 1 2 3 4

b 5 6 7 8 9

c 10 11 12 13 14

d 15 16 17 18 19

e 20 21 22 23 24

f 25 26 27 28 29

--------------------

A B C D E

a 0 1 2 3 4

b 5 6 7 8 9

c 10 11 12 13 14

d 15 16 17 18 19

e 20 21 22 23 24

--------------------

A B C D E

d 15 16 17 18 19

e 20 21 22 23 24

f 25 26 27 28 29

显示索引，列和数据

输入：

print("index is :" )

print(df2.index) # 输出行索引

print("columns is :" )

print(df2.columns) # 输出列索引

print("values is :" )

print(df2.values) # 输出数据

输出：

index is :

Index(['a', 'b', 'c', 'd', 'e', 'f'], dtype='object')

columns is :

Index(['A', 'B', 'C', 'D', 'E'], dtype='object')

values is :

[[ 0 1 2 3 4]

[ 5 6 7 8 9]

[10 11 12 13 14]

[15 16 17 18 19]

[20 21 22 23 24]

[25 26 27 28 29]]

使用loc方法，根据列名对相关索引进行切片，得到想要访问的数据。

# DataFrame.loc：按标签或布尔数组访问一组行和列

print(df2.loc['a':'f':2,'A']) # 获取A列，索引为a到f（包括f）中的数据，步长为2

输出：

a 0

c 10

e 20

Name: A, dtype: int64

查看数据的详细信息

DataFrame.describe(percentiles=None, include=None, exclude=None)

生成描述性统计数据，总结数据集分布的集中趋势，分散和形状，不包括 NaN值。

print(df2.describe()) # describe描述了数据的详细信息

输出：

A B C D E

count 6.000000 6.000000 6.000000 6.000000 6.000000

mean 12.500000 13.500000 14.500000 15.500000 16.500000

std 9.354143 9.354143 9.354143 9.354143 9.354143

min 0.000000 1.000000 2.000000 3.000000 4.000000

25% 6.250000 7.250000 8.250000 9.250000 10.250000

50% 12.500000 13.500000 14.500000 15.500000 16.500000

75% 18.750000 19.750000 20.750000 21.750000 22.750000

max 25.000000 26.000000 27.000000 28.000000 29.000000

5.DataFrame数据的操作

使用drop方法删除不需要的列或行

DataFrame.drop（labels = None，axis = 0，index = None，columns = None，level = None，inplace = False，errors ='raise' ）

通过指定标签名称和相应的轴，或直接指定索引或列名称来删除行或列。labels标签名，axis选择行或列（0表示行，1表示列），index 直接指定要删除的行，columns 直接指定要删除的列，inplace=False，默认该删除操作不改变原数据，而是返回一个执行删除操作后的新DataFrame。

data3 = np.arange(30).reshape(6,5)

df3 = pd.DataFrame(data3,index=['a','b','c','d','e', 'f'] , columns = ['A', 'B', 'C', 'D', 'E'])

a = df3.drop(['a'], axis=0) # axis=0时删除指定的行

b = df3.drop(['A'], axis=1) # axis=1时，删除指定的列

print('-------原始数据df-----')

print(df3)

print('-------删除行---------')

print(a)

print('-------删除列---------')

print(b)

输出：

-------原始数据df-----

A B C D E

a 0 1 2 3 4

b 5 6 7 8 9

c 10 11 12 13 14

d 15 16 17 18 19

e 20 21 22 23 24

f 25 26 27 28 29

-------删除行---------

A B C D E

b 5 6 7 8 9

c 10 11 12 13 14

d 15 16 17 18 19

e 20 21 22 23 24

f 25 26 27 28 29

-------删除列---------

B C D E

a 1 2 3 4

b 6 7 8 9

c 11 12 13 14

d 16 17 18 19

e 21 22 23 24

f 26 27 28 29

使用append方法合并两个DataFrame

DataFrame.append（other，ignore_index = False，verify_integrity = False，sort = None ）

将其他行附加到调用者的末尾，返回一个新对象。other为要追加的数据。

c = b.append(a,sort=True)

print(b)

print('------合并后产生的新数据------')

print(c)

输出：

B C D E

a 1 2 3 4

b 6 7 8 9

c 11 12 13 14

d 16 17 18 19

e 21 22 23 24

f 26 27 28 29

------合并后产生的新数据------

A B C D E

a NaN 1 2 3 4

b NaN 6 7 8 9

c NaN 11 12 13 14

d NaN 16 17 18 19

e NaN 21 22 23 24

f NaN 26 27 28 29

b 5.0 6 7 8 9

c 10.0 11 12 13 14

d 15.0 16 17 18 19

e 20.0 21 22 23 24

f 25.0 26 27 28 29

使用reset_index方法还原索引，让索引变为数据中的一列

DataFrame.reset_index（level = None，drop = False，inplace = False，col_level = 0，col_fill ='' ）

重置索引：level默认为None仅从索引中删除给定的级别，默认情况下删除所有级别。drop表示是否将索引添加至数据成为一列。

b.reset_index(inplace=True)# inplace为true时会修改原始数据，为False会产生新的数据

6.Pandas中缺失数据的操作

Pandas中缺少的数据默认使用NaN填充

df6 = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',

'h'],columns=['one', 'two', 'three'])

# 使用reindex方法设置新的索引，多出的索引对应的数据使用NaN填充

df6 = df6.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df6)

输出：

one two three

a -1.301964 -0.073759 -0.008598

b NaN NaN NaN

c -0.410554 -0.226690 0.508727

d NaN NaN NaN

e -0.420417 0.264073 -0.232839

f 0.273617 1.037572 0.623791

g NaN NaN NaN

h -0.136863 -0.964579 -0.173560

检查是否存在缺失

df7 = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',

'h'],columns=['one', 'two', 'three'])

df7 = df7.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

# reindex方法可以重置索引，多余的索引会被填充NaN

print(df7['one'].isnull()) # isnull方法可以检查数据中是否有空值

输出：

a False

b True

c False

d True

e False

f False

g True

h False

Name: one, dtype: bool

缺失数据的计算

求和数据时，NaN将被视为0，如果数据全部是NaN，那么结果将是NaN。

df8 = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f',

'h'],columns=['one', 'two', 'three'])

df8 = df8.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df8)

print(df8['one'].sum())

输出：

one two three

a 2.027398 0.056812 0.472538

b NaN NaN NaN

c 1.569014 -0.449907 2.184448

d NaN NaN NaN

e 0.178915 -0.259276 -0.350310

f -0.880906 0.132635 0.255886

g NaN NaN NaN

h -0.313819 0.464704 1.117843

2.580602384482807

用标量替换NaN

DataFrame.fillna（value = None，method = None，axis = None，inplace = False，limit = None，downcast = None，** kwargs ）

使用指定的方法和数据填充NA / NaN值。Value表示填充数据，method表示填充方法（‘backfill’，‘bfill’，‘pad’，‘ffill’，None）

df9 = pd.DataFrame(np.random.randn(3, 3), index=['a', 'c', 'e'],columns=['one',

'two', 'three'])

df9 = df9.reindex(['a', 'b', 'c'])

print(df9)

print("NaN replaced with '0':")

print(df9.fillna(0)) # fillna方法可以使用指定数据来填充NaN

输出：

one two three

a -2.171272 -1.143815 -2.022801

b NaN NaN NaN

c 1.041079 0.523148 0.079901

NaN replaced with '0':

one two three

a -2.171272 -1.143815 -2.022801

b 0.000000 0.000000 0.000000

c 1.041079 0.523148 0.079901

删除带有NaN的数据

DataFrame.dropna（axis = 0，how ='any'，thresh = None，subset = None，inplace = False ）

删除缺失值。How表示删除的方式（any：删除存在NA值的行或列；all：删除全部为NA的列或行）

df10 = pd.DataFrame(np.random.randn(5, 3), index=['a', 'c', 'e', 'f','h'],columns=['one', 'two', 'three'])

df10 = df10.reindex(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

print(df10)

print("---"*10)

print(df10.dropna())# 丢弃全部值为NaN的行，可以设置参数how=all来删除所有存在NaN值的行

输出：

one two three

a 0.314272 0.359712 -0.384114

b NaN NaN NaN

c 1.105737 1.290604 2.356330

d NaN NaN NaN

e 0.995182 0.150164 1.460220

f 1.691751 2.155625 0.143025

g NaN NaN NaN

h 1.859988 -1.575294 0.063176

------------------------------

one two three

a 0.314272 0.359712 -0.384114

c 1.105737 1.290604 2.356330

e 0.995182 0.150164 1.460220

f 1.691751 2.155625 0.143025

h 1.859988 -1.575294 0.063176

三、总结

系统性地介绍了

1、python科学计算的第三方库NumPy的相关操作。其中包含numpy的导入，numpy中的数据类型，以及numpy中一些简单运算函数的使用等知识。

2、python第三方库Pandas一些基本命令，它提供了诸多的数据处理方法和时间序列的操作方法，也是提供高性能易用数据类型和分析工具。在实验中介绍了DataFrame，Series两种数据结构的常用操作以及关于缺失值的处理。为后期机器学习打基础。

好文阅读

评论可见，请评论后查看内容，谢谢！！！

您阅读本篇文章共花了：

金钥匙

机器学习技术（二）——Python科学运算模块（Numpy、Pandas）

numpy python 程序人生6 【数模之数据分析-2】

numpy 【三分钟】熟练使用np.histogram函数

发表评论取消回复

金钥匙

机器学习技术（二）——Python科学运算模块（Numpy、Pandas）

numpy python 程序人生6 【数模之数据分析-2】

numpy 【三分钟】熟练使用np.histogram函数

相关文章

发表评论取消回复