快速注册

Python 数据处理实例练习1：Pandas与数据可视化入门

魏智勇 2019-01-12 11:49:53

数据收集、清洗、整理与数据可视化是Python数据处理的第一步，本练习通过一个实际的数据集（加拿大移民人口数据），对Pandas，Matplotlib库进行基本讲解。主要的数据可视化在python下依赖matplotlib和pandas，如果已经安装了Anaconda发行版，则这两个库默认都已经安装，如果只是安装Jupyter Notebook，则可以直接通过命令行命令进行安装。

!pip install pandas
!pip install matplotlib

1. 数据集引入

示例数据集来自加拿大官方移民数据，数据年限仅截止到2013年。首先，需要导入numpy和pandas两个库，对数据进行基本分析。因为数据是excel格式的（这是最广泛的数据格式之一，还需要安装xlrd库），在Anaconda和标准Python发行版下通过下列两个命令可以分别实现安装。

!pip install xlrd
!conda install -c anaconda xlrd --yes

需要注意的一点是，在Jupyter Notebook中安装包之后，内核可能不能马上导入使用，可以点击Kernel菜单下的Restart选项重新启动Kernel就可以恢复正常。

import numpy as np  # useful for many scientific computing in Python
import pandas as pd # primary data structure library

读取示例数据并显示头/尾。

df_can = pd.read_excel('https://ibm.box.com/shared/static/lw190pt9zpy5bd1ptyg2aw15awomz9pu.xlsx',
                       sheet_name='Canada by Citizenship',
                       skiprows=range(20),
                       skipfooter=2)

print ('Data read into a pandas dataframe!')

df_can.head()
# tip: 如果需要显示更多行可以指定数据，比如 df_can.head(10)

5 rows × 43 columns

df_can.tail()

5 rows × 43 columns

其他基本的查询指令，可以参考pandas的API文档。

df_can.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 195 entries, 0 to 194
Data columns (total 43 columns):
Type        195 non-null object
Coverage    195 non-null object
OdName      195 non-null object
AREA        195 non-null int64
AreaName    195 non-null object
REG         195 non-null int64
RegName     195 non-null object
DEV         195 non-null int64
DevName     195 non-null object
1980        195 non-null int64
1981        195 non-null int64
1982        195 non-null int64
1983        195 non-null int64
1984        195 non-null int64
1985        195 non-null int64
1986        195 non-null int64
1987        195 non-null int64
1988        195 non-null int64
1989        195 non-null int64
1990        195 non-null int64
1991        195 non-null int64
1992        195 non-null int64
1993        195 non-null int64
1994        195 non-null int64
1995        195 non-null int64
1996        195 non-null int64
1997        195 non-null int64
1998        195 non-null int64
1999        195 non-null int64
2000        195 non-null int64
2001        195 non-null int64
2002        195 non-null int64
2003        195 non-null int64
2004        195 non-null int64
2005        195 non-null int64
2006        195 non-null int64
2007        195 non-null int64
2008        195 non-null int64
2009        195 non-null int64
2010        195 non-null int64
2011        195 non-null int64
2012        195 non-null int64
2013        195 non-null int64
dtypes: int64(37), object(6)
memory usage: 61.0+ KB

df_can.columns.values

array(['Type', 'Coverage', 'OdName', 'AREA', 'AreaName', 'REG', 'RegName',
       'DEV', 'DevName', 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009,
       2010, 2011, 2012, 2013], dtype=object)

df_can.index.values

array([  0,   1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,
        13,  14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,
        26,  27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,
        39,  40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,
        52,  53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,
        65,  66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,
        78,  79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,
        91,  92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103,
       104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116,
       117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129,
       130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142,
       143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155,
       156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168,
       169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181,
       182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194],
      dtype=int64)

print(type(df_can.columns))
print(type(df_can.index))

<class 'pandas.core.indexes.base.Index'>
<class 'pandas.core.indexes.range.RangeIndex'>

df_can.columns.tolist()
df_can.index.tolist()

print (type(df_can.columns.tolist()))
print (type(df_can.index.tolist()))

<class 'list'>
<class 'list'>

# size of dataframe (rows, columns)
df_can.shape

(195, 43)

数据清洗与整理

对数据集需要做一些基本的清洗与整理，下列几个步骤分别去掉不需要的列，对部分列重新命名使得更具有可读性，并增加了一个汇总列。

# in pandas axis=0 represents rows (default) and axis=1 represents columns.
df_can.drop(['AREA','REG','DEV','Type','Coverage'], axis=1, inplace=True)
df_can.head(2)

2 rows × 38 columns

df_can.rename(columns={'OdName':'Country', 'AreaName':'Continent', 'RegName':'Region'}, inplace=True)
df_can.columns

Index([  'Country', 'Continent',    'Region',   'DevName',        1980,
              1981,        1982,        1983,        1984,        1985,
              1986,        1987,        1988,        1989,        1990,
              1991,        1992,        1993,        1994,        1995,
              1996,        1997,        1998,        1999,        2000,
              2001,        2002,        2003,        2004,        2005,
              2006,        2007,        2008,        2009,        2010,
              2011,        2012,        2013],
      dtype='object')

df_can['Total'] = df_can.sum(axis=1)

df_can.isnull().sum()

2. Pandas中级功能，索引与选择

先看看当下的数据集信息，然后通过练习熟悉各种索引的使用方式。

df_can.describe()

8 rows × 35 columns

df_can.Country  # 查找并返回所有国家列表

df_can[['Country', 1980, 1981, 1982, 1983, 1984, 1985]] # 返回特定年份的值
# 需要注意，国家名称是字符串类型而年份是整型
# 为了保证格式统一，可以将所有名称均改为整型

195 rows × 7 columns

df_can.set_index('Country', inplace=True)
# 将国家设置为索引项，与之相反的操作是 df_can.reset_index()

df_can.head(3)

3 rows × 38 columns

# 也可以去掉索引项的名称
df_can.index.name = None

# 1. 显示日本籍移民（所有列）
print(df_can.loc['Japan'])

# 其他实现方式
print(df_can.iloc[87])
print(df_can[df_can.index == 'Japan'].T.squeeze())

# 2. 2013年的数据
print(df_can.loc['Japan', 2013])

# 其他实现方式
print(df_can.iloc[87, 36]) # 2013年是最后一列，总共36列

982
982

# 3. 1980到1985年间的数据
print(df_can.loc['Japan', [1980, 1981, 1982, 1983, 1984, 1984]])
print(df_can.iloc[87, [3, 4, 5, 6, 7, 8]])

1980    701
1981    756
1982    598
1983    309
1984    246
1984    246
Name: Japan, dtype: object
1980    701
1981    756
1982    598
1983    309
1984    246
1985    198
Name: Japan, dtype: object

df_can.columns = list(map(str, df_can.columns))
# [print (type(x)) for x in df_can.columns.values] #<-- 检查是否正确，去掉前面的注释就可以运行

# 将要显示的年份转换为years列表，在后面显示图形时比较实用
years = list(map(str, range(1980, 2014)))
years

# 1. 通过判断创建布尔量序列
condition = df_can['Continent'] == 'Asia'
print (condition)

# 2. 将布尔序列传给数据集
df_can[condition]

49 rows × 38 columns

# 可以通过多个条件进行筛选
# l比如同时选择 AreaNAme = Asia 和RegName = Southern Asia

df_can[(df_can['Continent']=='Asia') & (df_can['Region']=='Southern Asia')]

#在使用逻辑操作符时， 需要用 &#x27;&' 和 '|' 取代 'and' 和 'or'
# 不同条件需要分别通过括号分开。

9 rows × 38 columns

print ('data dimensions:', df_can.shape)
print(df_can.columns)
df_can.head(2)

data dimensions: (195, 38)
Index(['Continent', 'Region', 'DevName', '1980', '1981', '1982', '1983',
       '1984', '1985', '1986', '1987', '1988', '1989', '1990', '1991', '1992',
       '1993', '1994', '1995', '1996', '1997', '1998', '1999', '2000', '2001',
       '2002', '2003', '2004', '2005', '2006', '2007', '2008', '2009', '2010',
       '2011', '2012', '2013', 'Total'],
      dtype='object')

2 rows × 38 columns

3. 使用Matplotlib用以实现数据可视化

# %是Jupyter Notebook的魔术命令，这里使用inline即在文件中显示内容
%matplotlib inline 

import matplotlib as mpl
import matplotlib.pyplot as plt

print ('Matplotlib version: ', mpl.__version__) # >= 2.0.0 版本要大于2.0.0

Matplotlib version:  3.0.2

print(plt.style.available)
mpl.style.use(['ggplot']) # optional: for ggplot-like style

['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']

haiti = df_can.loc['Haiti', years] # years参数见前节passing in years 1980 - 2013 to exclude the 'total' column
haiti.head()

1980    1666
1981    3692
1982    3498
1983    2860
1984    1418
Name: Haiti, dtype: object

haiti.plot()

haiti.index = haiti.index.map(int) # 将海地的索引项改为整数以显示年份
haiti.plot(kind='line')

plt.title('Immigration from Haiti')
plt.ylabel('Number of immigrants')
plt.xlabel('Years')

plt.show() # 本行用以显示图形

haiti.plot(kind='line')

#可以在图形中添加标签
plt.title('Immigration from Haiti')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# 也可以在指定位置插入数据
# syntax: plt.text(x, y, label)
plt.text(2000, 6000, '2010 Earthquake') # see note below

plt.show()

以下程序用以显示中国和印度籍移民图示

df_CI=df_can.loc[['China','India'],years]
df_CI.head()

1980198119821983198419851986198719881989…2004200520062007200820092010201120122013China5123668233081863152718161960264327584323…36619425843351827642300372962230391285023302434129India8880867081477338570442117150101891152210343…28235362103384828742282612945634235275093093333087

2 rows × 34 columns

df_CI.plot(kind='line')

<matplotlib.axes._subplots.AxesSubplot at 0x56c6d90>

上述的图形显然有问题，这主要是因为横轴纵轴错误，这是个常见的问题，需要通过transpose方法先修正。

df_CI = df_CI.transpose()
df_CI.head()

ChinaIndia198051238880198166828670198233088147198318637338198415275704

df_CI.index = df_CI.index.map(int)
df_CI.plot(kind='line')

plt.title('Immigration from China and India')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')

# annotate the 2010 Earthquake. 
# syntax: plt.text(x, y, label)

plt.show()

df_can.sort_values(by='Total', ascending=False, axis=0, inplace=True)
df_top5 = df_can.head(5)
df_top5 = df_top5[years].transpose() 
print(df_top5)

     India  China  United Kingdom of Great Britain and Northern Ireland  \
1980   8880   5123                                              22045      
1981   8670   6682                                              24796      
1982   8147   3308                                              20620      
1983   7338   1863                                              10015      
1984   5704   1527                                              10170      
1985   4211   1816                                               9564      
1986   7150   1960                                               9470      
1987  10189   2643                                              21337      
1988  11522   2758                                              27359      
1989  10343   4323                                              23795      
1990  12041   8076                                              31668      
1991  13734  14255                                              23380      
1992  13673  10846                                              34123      
1993  21496   9817                                              33720      
1994  18620  13128                                              39231      
1995  18489  14398                                              30145      
1996  23859  19415                                              29322      
1997  22268  20475                                              22965      
1998  17241  21049                                              10367      
1999  18974  30069                                               7045      
2000  28572  35529                                               8840      
2001  31223  36434                                              11728      
2002  31889  31961                                               8046      
2003  27155  36439                                               6797      
2004  28235  36619                                               7533      
2005  36210  42584                                               7258      
2006  33848  33518                                               7140      
2007  28742  27642                                               8216      
2008  28261  30037                                               8979      
2009  29456  29622                                               8876      
2010  34235  30391                                               8724      
2011  27509  28502                                               6204      
2012  30933  33024                                               6195      
2013  33087  34129                                               5827      

      Philippines  Pakistan  
1980         6051       978  
1981         5921       972  
1982         5249      1201  
1983         4562       900  
1984         3801       668  
1985         3150       514  
1986         4166       691  
1987         7360      1072  
1988         8639      1334  
1989        11865      2261  
1990        12509      2470  
1991        12718      3079  
1992        13670      4071  
1993        20479      4777  
1994        19532      4666  
1995        15864      4994  
1996        13692      9125  
1997        11549     13073  
1998         8735      9068  
1999         9734      9979  
2000        10763     15400  
2001        13836     16708  
2002        11707     15110  
2003        12758     13205  
2004        14004     13399  
2005        18139     14314  
2006        18400     13127  
2007        19837     10124  
2008        24887      8994  
2009        28573      7217  
2010        38617      6811  
2011        36765      7468  
2012        34315     11227  
2013        29544     12603

df_top5.index = df_top5.index.map(int) #let&#x27;s change the index values of df_top5 to type integer for plotting
df_top5.plot(kind='line', figsize=(14, 8)) # pass a tuple (x, y) size
plt.title('Immigration Trend of Top 5 Countries')
plt.ylabel('Number of Immigrants')
plt.xlabel('Years')
plt.show()

科技读书

回应转发赞收藏赞赏

魏智勇

个人网站 http://www.weizhiyong.com 爱生活，爱运动，爱科学，爱自控。...

Python 数据处理实例练习1：Pandas与数据可视化入门

1. 数据集引入

数据清洗与整理

2. Pandas中级功能，索引与选择

3. 使用Matplotlib用以实现数据可视化

热门话题 · · · · · · ( 去话题广场 )