Python 数据处理实例练习1:Pandas与数据可视化入门

数据收集、清洗、整理与数据可视化是Python数据处理的第一步,本练习通过一个实际的数据集(加拿大移民人口数据),对Pandas,Matplotlib库进行基本讲解。主要的数据可视化在python下依赖matplotlib和pandas,如果已经安装了Anaconda发行版,则这两个库默认都已经安装,如果只是安装Jupyter Notebook,则可以直接通过命令行命令进行安装。
!pip install pandas
!pip install matplotlib
1. 数据集引入
示例数据集来自加拿大官方移民数据,数据年限仅截止到2013年。首先,需要导入numpy和pandas两个库,对数据进行基本分析。因为数据是excel格式的(这是最广泛的数据格式之一,还需要安装xlrd库),在Anaconda和标准Python发行版下通过下列两个命令可以分别实现安装。
!pip install xlrd
!conda install -c anaconda xlrd --yes
需要注意的一点是,在Jupyter Notebook中安装包之后,内核可能不能马上导入使用,可以点击Kernel菜单下的Restart选项重新启动Kernel就可以恢复正常。
import numpy as np # useful for many scientific computing in Python
import pandas as pd # primary data structure library
读取示例数据并显示头/尾。
df_can = pd.read_excel('https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx',
sheet_name='Canada by Citizenship',
skiprows=range(20),
skipfooter=2)
print ('Data read into a pandas dataframe!')
df_can.head()
# tip: 如果需要显示更多行可以指定数据,比如 df_can.head(10)
5 rows × 43 columns
df_can.tail()
5 rows × 43 columns
其他基本的查询指令,可以参考pandas的API文档。
df_can.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 43 columns):
Type 195 non-null object
Coverage 195 non-null object
OdName 195 non-null object
AREA 195 non-null int64
AreaName 195 non-null object
REG 195 non-null int64
RegName 195 non-null object
DEV 195 non-null int64
DevName 195 non-null object
1980 195 non-null int64
1981 195 non-null int64
1982 195 non-null int64
1983 195 non-null int64
1984 195 non-null int64
1985 195 non-null int64
1986 195 non-null int64
1987 195 non-null int64
1988 195 non-null int64
1989 195 non-null int64
1990 195 non-null int64
1991 195 non-null int64
1992 195 non-null int64
1993 195 non-null int64
1994 195 non-null int64
1995 195 non-null int64
1996 195 non-null int64
1997 195 non-null int64
1998 195 non-null int64
1999 195 non-null int64
2000 195 non-null int64
2001 195 non-null int64
2002 195 non-null int64
2003 195 non-null int64
2004 195 non-null int64
2005 195 non-null int64
2006 195 non-null int64
2007 195 non-null int64
2008 195 non-null int64
2009 195 non-null int64
2010 195 non-null int64
2011 195 non-null int64
2012 195 non-null int64
2013 195 non-null int64
dtypes: int64(37), object(6)
memory usage: 61.0+ KB
df_can.columns.values
array(['Type', 'Coverage', 'OdName', 'AREA', 'AreaName', 'REG', 'RegName',
'DEV', 'DevName', 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
2010, 2011, 2012, 2013], dtype=object)
df_can.index.values
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25,
26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38,
39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51,
52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64,
65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77,
78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90,
91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103,
104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194],
dtype=int64)
print(type(df_can.columns))
print(type(df_can.index))
<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.range.RangeIndex'>
df_can.columns.tolist()
df_can.index.tolist()
print (type(df_can.columns.tolist()))
print (type(df_can.index.tolist()))
<class 'list'>
<class 'list'>
# size of dataframe (rows, columns)
df_can.shape
(195, 43)
数据清洗与整理
对数据集需要做一些基本的清洗与整理,下列几个步骤分别去掉不需要的列,对部分列重新命名使得更具有可读性,并增加了一个汇总列。
# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)
2 rows × 38 columns
df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns
Index([ 'Country', 'Continent', 'Region', 'DevName', 1980,
1981, 1982, 1983, 1984, 1985,
1986, 1987, 1988, 1989, 1990,
1991, 1992, 1993, 1994, 1995,
1996, 1997, 1998, 1999, 2000,
2001, 2002, 2003, 2004, 2005,
2006, 2007, 2008, 2009, 2010,
2011, 2012, 2013],
dtype='object')
df_can['Total'] = df_can.sum(axis=1)
df_can.isnull().sum()
2. Pandas中级功能,索引与选择
先看看当下的数据集信息,然后通过练习熟悉各种索引的使用方式。
df_can.describe()
8 rows × 35 columns
df_can.Country # 查找并返回所有国家列表
df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # 返回特定年份的值
# 需要注意,国家名称是字符串类型而年份是整型
# 为了保证格式统一,可以将所有名称均改为整型
195 rows × 7 columns
df_can.set_index('Country', inplace=True)
# 将国家设置为索引项,与之相反的操作是 df_can.reset_index()
df_can.head(3)
3 rows × 38 columns
# 也可以去掉索引项的名称
df_can.index.name = None
# 1. 显示日本籍移民(所有列)
print(df_can.loc['Japan'])
# 其他实现方式
print(df_can.iloc[87])
print(df_can[df_can.index == 'Japan'].T.squeeze())
# 2. 2013年的数据
print(df_can.loc['Japan', 2013])
# 其他实现方式
print(df_can.iloc[87, 36]) # 2013年是最后一列,总共36列
982
982
# 3. 1980到1985年间的数据
print(df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]])
print(df_can.iloc[87, [3, 4, 5, 6, 7, 8]])
1980 701
1981 756
1982 598
1983 309
1984 246
1984 246
Name: Japan, dtype: object
1980 701
1981 756
1982 598
1983 309
1984 246
1985 198
Name: Japan, dtype: object
df_can.columns = list(map(str, df_can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- 检查是否正确,去掉前面的注释就可以运行
# 将要显示的年份转换为years列表,在后面显示图形时比较实用
years = list(map(str, range(1980, 2014)))
years
# 1. 通过判断创建布尔量序列
condition = df_can['Continent'] == 'Asia'
print (condition)
# 2. 将布尔序列传给数据集
df_can[condition]
49 rows × 38 columns
# 可以通过多个条件进行筛选
# l比如同时选择 AreaNAme = Asia 和RegName = Southern Asia
df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]
#在使用逻辑操作符时, 需要用 '&' 和 '|' 取代 'and' 和 'or'
# 不同条件需要分别通过括号分开。
9 rows × 38 columns
print ('data dimensions:', df_can.shape)
print(df_can.columns)
df_can.head(2)
data dimensions: (195, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
'1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
'1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
'2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
'2011', '2012', '2013', 'Total'],
dtype='object')
2 rows × 38 columns
3. 使用Matplotlib用以实现数据可视化
# %是Jupyter Notebook的魔术命令,这里使用inline即在文件中显示内容
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0 版本要大于2.0.0
Matplotlib version: 3.0.2
print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style
['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']
haiti = df_can.loc['Haiti', years] # years参数见前节passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()
1980 1666
1981 3692
1982 3498
1983 2860
1984 1418
Name: Haiti, dtype: object
haiti.plot()
haiti.index = haiti.index.map(int) # 将海地的索引项改为整数以显示年份
haiti.plot(kind='line')
plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')
plt.show() # 本行用以显示图形
haiti.plot(kind='line')
#可以在图形中添加标签
plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
# 也可以在指定位置插入数据
# syntax: plt.text(x, y, label)
plt.text(2000, 6000, '2010 Earthquake') # see note below
plt.show()

以下程序用以显示中国和印度籍移民图示
df_CI=df_can.loc[['China','India'],years]
df_CI.head()
1980198119821983198419851986198719881989…2004200520062007200820092010201120122013China5123668233081863152718161960264327584323…36619425843351827642300372962230391285023302434129India8880867081477338570442117150101891152210343…28235362103384828742282612945634235275093093333087
2 rows × 34 columns
df_CI.plot(kind='line')
<matplotlib.axes._subplots.AxesSubplot at 0x56c6d90>

上述的图形显然有问题,这主要是因为横轴纵轴错误,这是个常见的问题,需要通过transpose方法先修正。
df_CI = df_CI.transpose()
df_CI.head()
ChinaIndia198051238880198166828670198233088147198318637338198415275704
df_CI.index = df_CI.index.map(int)
df_CI.plot(kind='line')
plt.title('Immigration from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
# annotate the 2010 Earthquake.
# syntax: plt.text(x, y, label)
plt.show()

df_can.sort_values(by='Total', ascending=False, axis=0, inplace=True)
df_top5 = df_can.head(5)
df_top5 = df_top5[years].transpose()
print(df_top5)
India China United Kingdom of Great Britain and Northern Ireland \
1980 8880 5123 22045
1981 8670 6682 24796
1982 8147 3308 20620
1983 7338 1863 10015
1984 5704 1527 10170
1985 4211 1816 9564
1986 7150 1960 9470
1987 10189 2643 21337
1988 11522 2758 27359
1989 10343 4323 23795
1990 12041 8076 31668
1991 13734 14255 23380
1992 13673 10846 34123
1993 21496 9817 33720
1994 18620 13128 39231
1995 18489 14398 30145
1996 23859 19415 29322
1997 22268 20475 22965
1998 17241 21049 10367
1999 18974 30069 7045
2000 28572 35529 8840
2001 31223 36434 11728
2002 31889 31961 8046
2003 27155 36439 6797
2004 28235 36619 7533
2005 36210 42584 7258
2006 33848 33518 7140
2007 28742 27642 8216
2008 28261 30037 8979
2009 29456 29622 8876
2010 34235 30391 8724
2011 27509 28502 6204
2012 30933 33024 6195
2013 33087 34129 5827
Philippines Pakistan
1980 6051 978
1981 5921 972
1982 5249 1201
1983 4562 900
1984 3801 668
1985 3150 514
1986 4166 691
1987 7360 1072
1988 8639 1334
1989 11865 2261
1990 12509 2470
1991 12718 3079
1992 13670 4071
1993 20479 4777
1994 19532 4666
1995 15864 4994
1996 13692 9125
1997 11549 13073
1998 8735 9068
1999 9734 9979
2000 10763 15400
2001 13836 16708
2002 11707 15110
2003 12758 13205
2004 14004 13399
2005 18139 14314
2006 18400 13127
2007 19837 10124
2008 24887 8994
2009 28573 7217
2010 38617 6811
2011 36765 7468
2012 34315 11227
2013 29544 12603
df_top5.index = df_top5.index.map(int) #let's change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size
plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.show()
