Pandas 是一个很强大的数据科学分析工具,你可以把它当做是Excel。它是NumPy的延伸库。如果结合NumPy使用,基本上可以解决大多数据问题。
我将从下面几个方面来介绍Pandas的一些常用功能和函数:
- Series 级数
- DataFrames 数据框
- Missing Data 空值(无效值)
- GroupBy 分组
- Merging, Joining,and Concatenating 归并
- Operations 运算操作0
- Data Input and Output 数据输入输出
注:
1,我最开始是用英语学习计算机的,所以有些英文的专业词汇我不知道怎么翻译更准确,本文中将保留这些专业名词。
2, Pandas和R语言的一些功能其实很像。但是作为计算机人员我更愿意用Python将程序与数据分析结合起来。以后有时间的话我会写一个R语言的教程并且写一个R与Python功能比较的参考。
Series
Series 很像 NumPy中的Array。两者的区别是: Series会有一个一个axis labels(维度标签),这个axis labels不仅仅是数字,Series可以用这个作为索引。而array的只能用实数表示位置。另外Series中的数据可以是任何类型的object,但是array中只能是数字(NumPy1.1版本数据类型)。从这个功能上看,Series有点像是Hash table。
让我们来看几个例子先:
import numpy as np
import pandas as pd
创建一个 Series
你可以将 list,numpy array, 或者 dictionary 转换成 Series:
labels = ['a','b','c'] #list
my_list = [10,20,30] #list
arr = np.array([10,20,30]) # numpy array
d = {'a':10,'b':20,'c':30} # dictionary
用 Lists
0 10
1 20
2 30
dtype: int64
pd.Series(data=my_list,index=labels) # Series主要的两个参数data在前,index在后
a 10
b 20
c 30
dtype: int64
pd.Series(my_list,labels) # 在正式项目中最好指明哪个变量指向哪个参数
以后我还要介绍更多数据科学工具,有些功能与NumPy或者pandas相似,但是更专业,出来的效果对business更有帮助。所以在对代码格式和报告要求不严格的项目中可以随意一些,但是严谨的business项目中最好要对格式有所要求。
a 10
b 20
c 30
dtype: int64
NumPy Arrays
pd.Series(arr) # axis labels 默认为有理实数
0 10
1 20
2 30
dtype: int64
a 10
b 20
c 30
dtype: int64
Dictionary
a 10
b 20
c 30
dtype: int64
Series里的数据类型
pandas Series 可以存放很多种数据类型:
0 a
1 b
2 c
dtype: object
# 甚至函数
pd.Series([sum,print,len])
0 <built-in function sum>
1 <built-in function print> # 必须加 from __future__ import print_function 在代码前,原因参考下面列表
2 <built-in function len>
dtype: object
所有的build-in函数列表
使用Index
使用Series的关键在于使用index,Series包含两个关键参数:data和index,可见index的地位与data同等重要。学会使用index可以更快查找数据。
下面几个例子将练习如何在Series中使用index,首先我们创建两个Series: ser1 & ser2
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
ser1
USA 1
Germany 2
USSR 3
Japan 4
dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
USA 1
Germany 2
Italy 5
Japan 4
dtype: int64
算数运算也会基于index进行
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64 # 一旦进行算数运算结果自动转换为float.并且结果根据index将序排列
简单介绍一下,如果想向Series中增加数据,一般用set_value.后面的参数需要跟index和data。append和add方法只能加入另外一个一个Series。
ser2.set_value('Canada',7)
USA 1
Germany 2
Italy 5
Japan 4
china 6
Canada 7
dtype: int64
Canada NaN
China NaN
Germany 4.0
Italy NaN
Japan 8.0
USA 2.0
USSR NaN
dtype: float64
# 只有两个Series中共有的index相加才有有结果,否则得到NaN值。
DataFrames
DataFrames 的重要性不言而喻,受R语言的启发而开发出来的。 DataFrame就是许多object分享同一个index。在我学习R的时候很难理解DataFrame的意义。简单点来说你可以把它当做Excel里的一个sheet,或者数据库里的一个表。
让我们来看几个例子
import pandas as pd
import numpy as np
from numpy.random import randn
np.random.seed(101) # seed 相当于reset。 因为随机数每次会取一个数作为seed,然后对这个数进行运算,通过算法得到一系列随机数,如果seed定了,那么得到的随机数是可以预测的。所以我们每次可以重置seed。
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
Selection and Indexing
从DataFrame中选取数据
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
# Pass a list of column names 注意是双括号
df[['W','Z']]
|
W |
Z |
A |
2.706850 |
0.503826 |
B |
0.651118 |
0.605965 |
C |
-2.018168 |
-0.589001 |
D |
0.188695 |
0.955057 |
E |
0.190794 |
0.683509 |
用SQL语法(不推荐)
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
DataFrame 的 Columns 就是 Series (DataFrame Columns are just Series)
pandas.core.series.Series
增加列
df['new'] = df['W'] + df['Y']
df
|
W |
X |
Y |
Z |
new |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
3.614819 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
-0.196959 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
-1.489355 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
-0.744542 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
2.796762 |
移除列
df.drop('new',axis=1) #axis=1指的是列,axis=0指行,axis=2指二维表,以此类推
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
注意:inplace参数默认为false,即是否替换原数据。上一个drop没有设置inplace参数,所以df的数据,没有改变
|
W |
X |
Y |
Z |
new |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
3.614819 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
-0.196959 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
-1.489355 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
-0.744542 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
2.796762 |
现在我们设置inplace参数
df.drop('new',axis=1,inplace=True)
df
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
同样可以删除行
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
选择行
df.loc['A'] # 选择行必须要用loc或者iloc函数
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
或者用位置代替index
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
选择子集
注意啊,如果直接用 df[‘B’,’Y’] 会报错,用 df[‘Y’,’B’] 也会报错。所以必须用df.loc[‘B’,’Y’]
df.loc[['A','B'],['W','Y']]
|
W |
Y |
A |
2.706850 |
0.907969 |
B |
0.651118 |
-0.848077 |
df.loc[['B','X','A','Y']]
B 0.190915
X NaN
A -0.747158
Y NaN
Name: Y, dtype: float64
情况比较选择(Conditional Selection)
一个很重要的功能,很像numpy的一个功能:
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
|
W |
X |
Y |
Z |
A |
True |
True |
True |
True |
B |
True |
False |
False |
True |
C |
False |
True |
True |
False |
D |
True |
False |
False |
True |
E |
True |
True |
True |
True |
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
NaN |
NaN |
0.605965 |
C |
NaN |
0.740122 |
0.528813 |
NaN |
D |
0.188695 |
NaN |
NaN |
0.955057 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
# 注意,在上一个情况中为false的值会用NaN占位(当然以后我们在机器学习中可以用其他数值替换NaN),但是在这个情况中,为false的值直接不显示
df[df['W']>0]
|
W |
X |
Y |
Z |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
# 注意啦!注意啦!此处是两个双中括号的并列关系,前面一个设定条件,后面一个设置需要选择的范围
|
Y |
X |
A |
0.907969 |
0.628133 |
B |
-0.848077 |
-0.319318 |
D |
-0.933237 |
-0.758872 |
E |
2.605967 |
1.978757 |
df[(df['W']>0) & (df['Y'] > 1)]
|
W |
X |
Y |
Z |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
更多index细节
让我们来看看更多index的使用情况
|
W |
X |
Y |
Z |
new |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
3.614819 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
-0.196959 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
-1.489355 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
-0.744542 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
2.796762 |
# 将index重置为 0,1...n
df.reset_index()
# 会有inplace参数 和fillin参数可以重置index
|
index |
W |
X |
Y |
Z |
0 |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
1 |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
2 |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
3 |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
4 |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df # 增加新的一列
|
W |
X |
Y |
Z |
States |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
CA |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
NY |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
WY |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
OR |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
CO |
|
W |
X |
Y |
Z |
States |
|
|
|
|
CA |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
NY |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
WY |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
OR |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
CO |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
|
W |
X |
Y |
Z |
States |
A |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
CA |
B |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
NY |
C |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
WY |
D |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
OR |
E |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
CO |
df.set_index('States',inplace=True)
# 由此可见inplace参数将影响到原数据
df
|
W |
X |
Y |
Z |
States |
|
|
|
|
CA |
2.706850 |
0.628133 |
0.907969 |
0.503826 |
NY |
0.651118 |
-0.319318 |
-0.848077 |
0.605965 |
WY |
-2.018168 |
0.740122 |
0.528813 |
-0.589001 |
OR |
0.188695 |
-0.758872 |
-0.933237 |
0.955057 |
CO |
0.190794 |
1.978757 |
2.605967 |
0.683509 |
复合index和index阶层
复合index 需要将index做成一个tuple
# Index Levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]],
labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
|
|
A |
B |
Group |
Num |
|
|
|
1 |
0.153661 |
0.167638 |
G1 |
2 |
-0.765930 |
0.962299 |
|
3 |
0.902826 |
-0.537909 |
|
1 |
-1.549671 |
0.435253 |
G2 |
2 |
1.259904 |
-0.447898 |
|
3 |
0.266207 |
0.412580 |
利用xs函数返回dataframe的 cross-section子集
python
df.xs('G1')
|
A |
B |
Num |
|
|
1 |
0.153661 |
0.167638 |
2 |
-0.765930 |
0.962299 |
3 |
0.902826 |
-0.537909 |
A 0.153661
B 0.167638
Name: (G1, 1), dtype: float64
|
A |
B |
Group |
|
|
G1 |
0.153661 |
0.167638 |
G2 |
-1.549671 |
0.435253 |
Missing Data(无效数据)
Pandas有几种方法来处理无效的数据
import numpy as np
import pandas as pd
df = pd.DataFrame({'A':[1,2,np.nan],
'B':[5,np.nan,np.nan],
'C':[1,2,3]})
df
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
NaN |
2 |
2 |
NaN |
NaN |
3 |
df.dropna() # 将含有nan值的rows一同删除
| | A | B | C |
| —- | —- | —- | —- |
| 0 | 1.0 | 5.0 | 1 |
python
df.dropna(axis=1) # 将含有nan值的columns一同删除
df.dropna(thresh=2) # 这个很有用, drop掉所有含有有效数据(除NaN外)小于thresh的Series
# 还有个how参数,需要指定是drop掉全为NaN的行还是drop掉出现NaN的行,这我就不给出例子了
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
NaN |
2 |
df.fillna(value='FILL VALUE') # 这个不用介绍了吧
|
A |
B |
C |
0 |
1 |
5 |
1 |
1 |
2 |
FILL VALUE |
2 |
2 |
FILL VALUE |
FILL VALUE |
3 |
df['A'].fillna(value=df['A'].mean())
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
df.fillna(value=df['B'].mean())
|
A |
B |
C |
0 |
1.0 |
5.0 |
1 |
1 |
2.0 |
5.0 |
2 |
2 |
5.0 |
5.0 |
3 |
所以记住处理NaN数据用dropna() 和 fillna() 两个方法就OK啦~~
Groupby
Groupby可以让你的数据分组并调用聚合函数。具体可以参考数据库中的groupby
import pandas as pd
# Create dataframe
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
df
|
Company |
Person |
Sales |
0 |
GOOG |
Sam |
200 |
1 |
GOOG |
Charlie |
120 |
2 |
MSFT |
Amy |
340 |
3 |
MSFT |
Vanessa |
124 |
4 |
FB |
Carl |
243 |
5 |
FB |
Sarah |
350 |
现在我们用groupBy()方法来进行分组,以下将增加一个 DataFrameGroupBy 的 object:
python
df.groupby('Company')
<pandas.core.groupby.DataFrameGroupBy object at 0x113014128>
我们将它赋给一个新的变量
python
by_comp = df.groupby("Company")
来call聚合方法
python
by_comp.mean()
|
Sales |
Company |
|
FB |
296.5 |
GOOG |
160.0 |
MSFT |
232.0 |
等同于:
python
df.groupby('Company').mean()
|
Sales |
Company |
|
FB |
296.5 |
GOOG |
160.0 |
MSFT |
232.0 |
标准差:
python
by_comp.std()
|
Sales |
Company |
|
FB |
75.660426 |
GOOG |
56.568542 |
MSFT |
152.735065 |
最小值:
python
by_comp.min()
| | Person | Sales |
| ——- | ——- | —– |
| Company | | |
| FB | Carl | 243 |
| GOOG | Charlie | 120 |
| MSFT | Amy | 124 |
最大值:
python
by_comp.max()
| | Person | Sales |
| ——- | ——- | —– |
| Company | | |
| FB | Sarah | 350 |
| GOOG | Sam | 200 |
| MSFT | Vanessa | 340 |
计数:
python
by_comp.count()
|
Person |
Sales |
Company |
|
|
FB |
2 |
2 |
GOOG |
2 |
2 |
MSFT |
2 |
2 |
描述分析(等同于数据库中的功能):
python
by_comp.describe()
|
|
Sales |
Company |
|
|
FB |
count |
2.000000 |
FB |
mean |
296.500000 |
FB |
std |
75.660426 |
FB |
min |
243.000000 |
FB |
25% |
269.750000 |
FB |
50% |
296.500000 |
FB |
75% |
323.250000 |
FB |
max |
350.000000 |
GOOG |
count |
2.000000 |
GOOG |
mean |
160.000000 |
GOOG |
std |
56.568542 |
GOOG |
min |
120.000000 |
GOOG |
25% |
140.000000 |
GOOG |
50% |
160.000000 |
GOOG |
75% |
180.000000 |
GOOG |
max |
200.000000 |
MSFT |
count |
2.000000 |
MSFT |
mean |
232.000000 |
MSFT |
std |
152.735065 |
MSFT |
min |
124.000000 |
MSFT |
25% |
178.000000 |
MSFT |
50% |
232.000000 |
MSFT |
75% |
286.000000 |
MSFT |
max |
340.000000 |
注:由于markdown格式问题,我将公司名称全部写了出来,在编译器中公司名称其实只显示一个。
by_comp.describe().transpose() # 翻转
Company |
FB |
|
|
|
|
|
|
|
|
GOOG |
|
|
|
|
|
|
MSFT |
|
|
|
|
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
… |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
Sales |
2.0 |
296.5 |
75.660426 |
243.0 |
269.75 |
296.5 |
323.25 |
350.0 |
2.0 |
160.0 |
… |
180.0 |
200.0 |
2.0 |
232.0 |
152.735065 |
124.0 |
178.0 |
232.0 |
286.0 |
340.0 |
by_comp.describe().transpose()['GOOG']
也可以翻转一个公司的数据,我就不打印出来了
Merging, Joining, and Concatenating(归并,连接,级联)
这是三个主要的进行数据合并的方法,其中几种join的区别我就不详细介绍了,主要简单介绍一下这三种方法的差别
Concatenation(级联)
最基本的连接DF的方法,注意连接的DF维度一定要相同,可以通过axis参数来设置级联的位置,默认为0(在row的后面连接)
首先我们先建立三个DF
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
'B': ['B4', 'B5', 'B6', 'B7'],
'C': ['C4', 'C5', 'C6', 'C7'],
'D': ['D4', 'D5', 'D6', 'D7']},
index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
'B': ['B8', 'B9', 'B10', 'B11'],
'C': ['C8', 'C9', 'C10', 'C11'],
'D': ['D8', 'D9', 'D10', 'D11']},
index=[8, 9, 10, 11])
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
D2 |
3 |
A3 |
B3 |
C3 |
D3 |
|
A |
B |
C |
D |
4 |
A4 |
B4 |
C4 |
D4 |
5 |
A5 |
B5 |
C5 |
D5 |
6 |
A6 |
B6 |
C6 |
D6 |
7 |
A7 |
B7 |
C7 |
D7 |
|
A |
B |
C |
D |
8 |
A8 |
B8 |
C8 |
D8 |
9 |
A9 |
B9 |
C9 |
D9 |
10 |
A10 |
B10 |
C10 |
D10 |
11 |
A11 |
B11 |
C11 |
D11 |
|
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
1 |
A1 |
B1 |
C1 |
D1 |
2 |
A2 |
B2 |
C2 |
D2 |
3 |
A3 |
B3 |
C3 |
D3 |
4 |
A4 |
B4 |
C4 |
D4 |
5 |
A5 |
B5 |
C5 |
D5 |
6 |
A6 |
B6 |
C6 |
D6 |
7 |
A7 |
B7 |
C7 |
D7 |
8 |
A8 |
B8 |
C8 |
D8 |
9 |
A9 |
B9 |
C9 |
D9 |
10 |
A10 |
B10 |
C10 |
D10 |
11 |
A11 |
B11 |
C11 |
D11 |
pd.concat([df1,df2,df3],axis=1) # 通过column相连
|
A |
B |
C |
D |
A |
B |
C |
D |
A |
B |
C |
D |
0 |
A0 |
B0 |
C0 |
D0 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
1 |
A1 |
B1 |
C1 |
D1 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
2 |
A2 |
B2 |
C2 |
D2 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
A3 |
B3 |
C3 |
D3 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
4 |
NaN |
NaN |
NaN |
NaN |
A4 |
B4 |
C4 |
D4 |
NaN |
NaN |
NaN |
NaN |
5 |
NaN |
NaN |
NaN |
NaN |
A5 |
B5 |
C5 |
D5 |
NaN |
NaN |
NaN |
NaN |
6 |
NaN |
NaN |
NaN |
NaN |
A6 |
B6 |
C6 |
D6 |
NaN |
NaN |
NaN |
NaN |
7 |
NaN |
NaN |
NaN |
NaN |
A7 |
B7 |
C7 |
D7 |
NaN |
NaN |
NaN |
NaN |
8 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A8 |
B8 |
C8 |
D8 |
9 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A9 |
B9 |
C9 |
D9 |
10 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A10 |
B10 |
C10 |
D10 |
11 |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
A11 |
B11 |
C11 |
D11 |
Merging
合并有点像SQL语句里的join,需要keys(在两个DF中key可以是相同的Series也可以是不同的)进行合并
与join的区别是merge的key是通过相似的column(在实际问题中),并且两个合并的表没有主从之分,而join的index是指row的index,有一个为主一个为副。
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
|
A |
B |
key |
0 |
A0 |
B0 |
K0 |
1 |
A1 |
B1 |
K1 |
2 |
A2 |
B2 |
K2 |
3 |
A3 |
B3 |
K3 |
|
C |
D |
key |
0 |
C0 |
D0 |
K0 |
1 |
C1 |
D1 |
K1 |
2 |
C2 |
D2 |
K2 |
3 |
C3 |
D3 |
K3 |
pd.merge(left,right,how='inner',on='key')
|
A |
B |
key |
C |
D |
0 |
A0 |
B0 |
K0 |
C0 |
D0 |
1 |
A1 |
B1 |
K1 |
C1 |
D1 |
2 |
A2 |
B2 |
K2 |
C2 |
D2 |
3 |
A3 |
B3 |
K3 |
C3 |
D3 |
下面我们使用key不同的两个表
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
pd.merge(left, right, on=['key1', 'key2'])
|
A |
B |
key1 |
key2 |
C |
D |
0 |
A0 |
B0 |
K0 |
K0 |
C0 |
D0 |
1 |
A2 |
B2 |
K1 |
K0 |
C1 |
D1 |
2 |
A2 |
B2 |
K1 |
K0 |
C2 |
D2 |
pd.merge(right,left, on=['key1', 'key2']) # 注意左右交换的结果
|
C |
D |
key1 |
key2 |
A |
B |
0 |
C0 |
D0 |
K0 |
K0 |
A0 |
B0 |
1 |
C1 |
D1 |
K1 |
K0 |
A2 |
B2 |
2 |
C2 |
D2 |
K1 |
K0 |
A2 |
B2 |
pd.merge(left, right, how='outer', on=['key1', 'key2'])
|
A |
B |
key1 |
key2 |
C |
D |
0 |
A0 |
B0 |
K0 |
K0 |
C0 |
D0 |
1 |
A1 |
B1 |
K0 |
K1 |
NaN |
NaN |
2 |
A2 |
B2 |
K1 |
K0 |
C1 |
D1 |
3 |
A2 |
B2 |
K1 |
K0 |
C2 |
D2 |
4 |
A3 |
B3 |
K2 |
K1 |
NaN |
NaN |
5 |
NaN |
NaN |
K2 |
K0 |
C3 |
D3 |
pd.merge(left, right, how='right', on=['key1', 'key2'])
|
A |
B |
key1 |
key2 |
C |
D |
0 |
A0 |
B0 |
K0 |
K0 |
C0 |
D0 |
1 |
A2 |
B2 |
K1 |
K0 |
C1 |
D1 |
2 |
A2 |
B2 |
K1 |
K0 |
C2 |
D2 |
3 |
NaN |
NaN |
K2 |
K0 |
C3 |
D3 |
pd.merge(left, right, how='left', on=['key1', 'key2'])
|
A |
B |
key1 |
key2 |
C |
D |
0 |
A0 |
B0 |
K0 |
K0 |
C0 |
D0 |
1 |
A1 |
B1 |
K0 |
K1 |
NaN |
NaN |
2 |
A2 |
B2 |
K1 |
K0 |
C1 |
D1 |
3 |
A2 |
B2 |
K1 |
K0 |
C2 |
D2 |
4 |
A3 |
B3 |
K2 |
K1 |
NaN |
NaN |
Joining
Joining是将两个不同index的DF合并成一个单一的DF,所以要求大部分column相同
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
|
A |
B |
C |
D |
K0 |
A0 |
B0 |
C0 |
D0 |
K1 |
A1 |
B1 |
NaN |
NaN |
K2 |
A2 |
B2 |
C2 |
D2 |
left.join(right, how='outer')
|
A |
B |
C |
D |
K0 |
A0 |
B0 |
C0 |
D0 |
K1 |
A1 |
B1 |
NaN |
NaN |
K2 |
A2 |
B2 |
C2 |
D2 |
K3 |
NaN |
NaN |
C3 |
D3 |
Operations(运算)
哈,终于说到运算操作啦。其中很多方法与R语言的很像。
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df.head()
|
col1 |
col2 |
col3 |
0 |
1 |
444 |
abc |
1 |
2 |
555 |
def |
2 |
3 |
666 |
ghi |
3 |
4 |
444 |
xyz |
各种Unique值
df['col2'].unique() # 找出唯一的值并返回array
df['col2'].nunique() # 返回唯一值的个数
df['col2'].value_counts() # 计数
444 2
555 1
666 1
Name: col2, dtype: int64
选择数据
# 从复合条件中选择数据
newdf = df[(df['col1']>2) & (df['col2']==444)]
|
col1 |
col2 |
col3 |
3 |
4 |
444 |
xyz |
应用方程或函数
def times2(x):
return x*2
df['col1'].apply(times2) # 通过apply对DF使用自定义函数
0 2
1 4
2 6
3 8
Name: col1, dtype: int64
0 3
1 3
2 3
3 3
Name: col3, dtype: int64
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
得到column或者index名称
Index(['col2', 'col3'], dtype='object')
RangeIndex(start=0, stop=4, step=1)
排序
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
df.sort_values(by='col2') #inplace=False 默认
|
col2 |
col3 |
0 |
444 |
abc |
3 |
444 |
xyz |
1 |
555 |
def |
2 |
666 |
ghi |
判空
|
col2 |
col3 |
0 |
False |
False |
1 |
False |
False |
2 |
False |
False |
3 |
False |
False |
# Drop rows with NaN Values
df.dropna()
|
col2 |
col3 |
0 |
444 |
abc |
1 |
555 |
def |
2 |
666 |
ghi |
3 |
444 |
xyz |
替换NaN
df = pd.DataFrame({'col1':[1,2,3,np.nan],
'col2':[np.nan,555,666,444],
'col3':['abc','def','ghi','xyz']})
df.head()
|
col1 |
col2 |
col3 |
0 |
1.0 |
NaN |
abc |
1 |
2.0 |
555.0 |
def |
2 |
3.0 |
666.0 |
ghi |
3 |
NaN |
444.0 |
xyz |
|
col1 |
col2 |
col3 |
0 |
1 |
FILL |
abc |
1 |
2 |
555 |
def |
2 |
3 |
666 |
ghi |
3 |
FILL |
444 |
xyz |
建立一个基准表,挺有意思的
data = {'A':['foo','foo','foo','bar','bar','bar'],
'B':['one','one','two','two','one','one'],
'C':['x','y','x','y','x','y'],
'D':[1,3,2,5,4,1]}
df = pd.DataFrame(data)
df
|
A |
B |
C |
D |
0 |
foo |
one |
x |
1 |
1 |
foo |
one |
y |
3 |
2 |
foo |
two |
x |
2 |
3 |
bar |
two |
y |
5 |
4 |
bar |
one |
x |
4 |
5 |
bar |
one |
y |
1 |
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
|
C |
x |
y |
A |
B |
|
|
bar |
one |
4.0 |
1.0 |
bar |
two |
NaN |
5.0 |
foo |
one |
1.0 |
3.0 |
foo |
two |
2.0 |
NaN |
数据输入与输出
pandas需要用 pd.read_ methods可以读取各种类型的数据哦!!
import numpy as np
import pandas as pd
## CSV
CSV 输入
df = pd.read_csv('example') # 文件最好在相同目录下,不在的话需要指定文件路径
df
|
a |
b |
c |
d |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
3 |
12 |
13 |
14 |
15 |
CSV 输出
df.to_csv('example',index=False)
## Excel
Pandas 仅仅针对Data进行输入,不能输入图像或者公式,否则会报错
Excel 输入
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1') # 注意需要指明哪个sheet
|
a |
b |
c |
d |
0 |
0 |
1 |
2 |
3 |
1 |
4 |
5 |
6 |
7 |
2 |
8 |
9 |
10 |
11 |
3 |
12 |
13 |
14 |
15 |
Excel 输出
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
## HTML
需要先安装Python的html5库支持
pip install lxml
pip install html5lib
pip install BeautifulSoup4 # 这个我很喜欢,写一下简单的爬虫时候比较好用
HTML 输入
Pandas read_html 将会自动读取网页中的表格并且返回一个包含DataFrame的List对象
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0]
数据太多了我就不列出来了,大家可以自己试试
HTML 输出
df.to_html(‘banklist.html’) # 注意一定要扩展名
SQL语句
一般不需要用pandas直接对数据库的数据进行读取。由于安全考虑和独立性的原因。
但是pandas有 pandas.io.sql 模块可以对数据库进行数据读取,但是要有合适的API
以下是一些常用的功能:
- read_sql_table(table_name, con[, schema, …])
读取数据库中的表到DF中
- read_sql_query(sql, con[, index_col, …])
读取Query进DF
- read_sql(sql, con[, index_col, …])
读取表或者query进DF
- DataFrame.to_sql(name, con[, flavor, …])
将DF中的记录导入数据库
以下给出一个简单的例子:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///:memory:') # Python默认使用sqlite数据库
df.to_sql('data', engine)
sql_df = pd.read_sql('data',con=engine)
sql_df
|
index |
a |
b |
c |
d |
0 |
0 |
0 |
1 |
2 |
3 |
1 |
1 |
4 |
5 |
6 |
7 |
2 |
2 |
8 |
9 |
10 |
11 |
3 |
3 |
12 |
13 |
14 |
15 |