人口分析案例

人口分析案例

  • 需求:
    • 导入文件,查看原始数据
    • 将人口数据和各州的简称数据进行合并
    • 将合并的数据中重复的abbreviation列进行删除
    • 查看存在缺失数据的列
    • 找到有哪些state/region使得state的值为NaN,进行去重操作
    • 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
    • 合并各州面积数据areas
    • 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
    • 去除含有缺失数据的行
    • 找出2010年的全民人口数据
    • 计算各州的人口密度
    • 排序,并找出人口密度最高的州
1
2
3
import numpy as np
import pandas as pd
from pandas import DataFrame
1
2
3
4
5
6
7
8
9
10
11
# 导入文件,查看原始数据
abb = pd.read_csv('data\\state-abbrevs.csv') # state表示州的全称,abbreviation表示州的简称
print(abb.head())
area = pd.read_csv('data\\state-areas.csv') # state州全称,area(sq.mi)州面积
print(area.head())
pop = pd.read_csv('data\\state-population.csv') # state/region简称,population人口数量
print(pop.head())
# 将人口数据和各州的简称数据进行合并
abb_pop = pd.merge(abb,pop,left_on='abbreviation',right_on='state/region',how='outer') # 默认是内连接inner,为了保证数据完整性此处指定outer
print(abb_pop.head())

        state abbreviation
0     Alabama           AL
1      Alaska           AK
2     Arizona           AZ
3    Arkansas           AR
4  California           CA
        state  area (sq. mi)
0     Alabama          52423
1      Alaska         656425
2     Arizona         114006
3    Arkansas          53182
4  California         163707
  state/region     ages  year  population
0           AL  under18  2012   1117489.0
1           AL    total  2012   4817528.0
2           AL  under18  2010   1130966.0
3           AL    total  2010   4785570.0
4           AL  under18  2011   1125763.0
     state abbreviation state/region     ages  year  population
0  Alabama           AL           AL  under18  2012   1117489.0
1  Alabama           AL           AL    total  2012   4817528.0
2  Alabama           AL           AL  under18  2010   1130966.0
3  Alabama           AL           AL    total  2010   4785570.0
4  Alabama           AL           AL  under18  2011   1125763.0
1
2
3
4
5
6
# 将合并的数据中重复的abbreviation列进行删除
abb_pop.drop(labels='abbreviation',axis=1,inplace=True) # inplace=True将编辑的数据映射入原始数据,故不需要再次使用‘abb_pop=’来重新赋值
print(abb_pop.head())
# 查看存在缺失数据的列
print(abb_pop.isnull().any(axis=0)) # 存在空值的列为True
print(abb_pop.info()) # 返回数据基本信息,从数据数量看出哪些列存在空值
     state state/region     ages  year  population
0  Alabama           AL  under18  2012   1117489.0
1  Alabama           AL    total  2012   4817528.0
2  Alabama           AL  under18  2010   1130966.0
3  Alabama           AL    total  2010   4785570.0
4  Alabama           AL  under18  2011   1125763.0
state            True
state/region    False
ages            False
year            False
population       True
dtype: bool
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2544 entries, 0 to 2543
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   state         2448 non-null   object 
 1   state/region  2544 non-null   object 
 2   ages          2544 non-null   object 
 3   year          2544 non-null   int64  
 4   population    2524 non-null   float64
dtypes: float64(1), int64(1), object(3)
memory usage: 119.2+ KB
None
1
2
3
4
5
6
# 找到有哪些state/region使得state的值为NaN,进行去重操作
print(abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()) # series的unique方法按顺序返回出现唯一一次的内容
# 为找到的这些state/region的state项补上正确的值,从而去除掉state这一列的所有NaN
# 此处不能使用fillna填充,因为不是使用临近值和固定值填充;使用元素赋值的方式进行填充
# 1.先给USA的全称对应的空值进行批量赋值,首先找到USA对应的行数据。
print(abb_pop.loc[abb_pop['state/region'] == 'USA'])
['PR' 'USA']
     state state/region     ages  year   population
2496   NaN          USA  under18  1990   64218512.0
2497   NaN          USA    total  1990  249622814.0
2498   NaN          USA    total  1991  252980942.0
2499   NaN          USA  under18  1991   65313018.0
2500   NaN          USA  under18  1992   66509177.0
2501   NaN          USA    total  1992  256514231.0
2502   NaN          USA    total  1993  259918595.0
2503   NaN          USA  under18  1993   67594938.0
2504   NaN          USA  under18  1994   68640936.0
2505   NaN          USA    total  1994  263125826.0
2506   NaN          USA  under18  1995   69473140.0
2507   NaN          USA  under18  1996   70233512.0
2508   NaN          USA    total  1995  266278403.0
2509   NaN          USA    total  1996  269394291.0
2510   NaN          USA    total  1997  272646932.0
2511   NaN          USA  under18  1997   70920738.0
2512   NaN          USA  under18  1998   71431406.0
2513   NaN          USA    total  1998  275854116.0
2514   NaN          USA  under18  1999   71946051.0
2515   NaN          USA    total  2000  282162411.0
2516   NaN          USA  under18  2000   72376189.0
2517   NaN          USA    total  1999  279040181.0
2518   NaN          USA    total  2001  284968955.0
2519   NaN          USA  under18  2001   72671175.0
2520   NaN          USA    total  2002  287625193.0
2521   NaN          USA  under18  2002   72936457.0
2522   NaN          USA    total  2003  290107933.0
2523   NaN          USA  under18  2003   73100758.0
2524   NaN          USA    total  2004  292805298.0
2525   NaN          USA  under18  2004   73297735.0
2526   NaN          USA    total  2005  295516599.0
2527   NaN          USA  under18  2005   73523669.0
2528   NaN          USA    total  2006  298379912.0
2529   NaN          USA  under18  2006   73757714.0
2530   NaN          USA    total  2007  301231207.0
2531   NaN          USA  under18  2007   74019405.0
2532   NaN          USA    total  2008  304093966.0
2533   NaN          USA  under18  2008   74104602.0
2534   NaN          USA  under18  2013   73585872.0
2535   NaN          USA    total  2013  316128839.0
2536   NaN          USA    total  2009  306771529.0
2537   NaN          USA  under18  2009   74134167.0
2538   NaN          USA  under18  2010   74119556.0
2539   NaN          USA    total  2010  309326295.0
2540   NaN          USA  under18  2011   73902222.0
2541   NaN          USA    total  2011  311582564.0
2542   NaN          USA  under18  2012   73708179.0
2543   NaN          USA    total  2012  313873685.0
1
2
3
4
5
6
7
8
9
# 2.获取USA全称为空的数据对应的行索引
indexs = abb_pop.loc[abb_pop['state/region'] == 'USA'].index
print(indexs)
abb_pop.loc[indexs,'state'] = 'United States'
print(abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()) # 此处只剩PR
# 对于PR来说过程与USA相同
indexs = abb_pop.loc[abb_pop['state/region'] == 'PR'].index
abb_pop.loc[indexs,'state'] = 'Puerto Rico'
print(abb_pop.loc[abb_pop['state'].isnull()]['state/region'].unique()) # 此处没有空值了
Int64Index([2496, 2497, 2498, 2499, 2500, 2501, 2502, 2503, 2504, 2505, 2506,
            2507, 2508, 2509, 2510, 2511, 2512, 2513, 2514, 2515, 2516, 2517,
            2518, 2519, 2520, 2521, 2522, 2523, 2524, 2525, 2526, 2527, 2528,
            2529, 2530, 2531, 2532, 2533, 2534, 2535, 2536, 2537, 2538, 2539,
            2540, 2541, 2542, 2543],
           dtype='int64')
['PR']
[]
1
2
3
4
5
6
7
# 合并各州面积数据areas
abb_pop_area = pd.merge(abb_pop,area,how='outer')
# 我们会发现area(sq.mi)这一列有缺失数据,找出是哪些行
abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()] # 空对应的行数据
indexs = abb_pop_area.loc[abb_pop_area['area (sq. mi)'].isnull()].index
# 去除含有缺失数据的行
abb_pop_area.drop(labels=indexs,axis=0,inplace=True)

state state/region ages year population area (sq. mi)
2496 United States USA under18 1990 64218512.0 NaN
2497 United States USA total 1990 249622814.0 NaN
2498 United States USA total 1991 252980942.0 NaN
2499 United States USA under18 1991 65313018.0 NaN
2500 United States USA under18 1992 66509177.0 NaN
2501 United States USA total 1992 256514231.0 NaN
2502 United States USA total 1993 259918595.0 NaN
2503 United States USA under18 1993 67594938.0 NaN
2504 United States USA under18 1994 68640936.0 NaN
2505 United States USA total 1994 263125826.0 NaN
2506 United States USA under18 1995 69473140.0 NaN
2507 United States USA under18 1996 70233512.0 NaN
2508 United States USA total 1995 266278403.0 NaN
2509 United States USA total 1996 269394291.0 NaN
2510 United States USA total 1997 272646932.0 NaN
2511 United States USA under18 1997 70920738.0 NaN
2512 United States USA under18 1998 71431406.0 NaN
2513 United States USA total 1998 275854116.0 NaN
2514 United States USA under18 1999 71946051.0 NaN
2515 United States USA total 2000 282162411.0 NaN
2516 United States USA under18 2000 72376189.0 NaN
2517 United States USA total 1999 279040181.0 NaN
2518 United States USA total 2001 284968955.0 NaN
2519 United States USA under18 2001 72671175.0 NaN
2520 United States USA total 2002 287625193.0 NaN
2521 United States USA under18 2002 72936457.0 NaN
2522 United States USA total 2003 290107933.0 NaN
2523 United States USA under18 2003 73100758.0 NaN
2524 United States USA total 2004 292805298.0 NaN
2525 United States USA under18 2004 73297735.0 NaN
2526 United States USA total 2005 295516599.0 NaN
2527 United States USA under18 2005 73523669.0 NaN
2528 United States USA total 2006 298379912.0 NaN
2529 United States USA under18 2006 73757714.0 NaN
2530 United States USA total 2007 301231207.0 NaN
2531 United States USA under18 2007 74019405.0 NaN
2532 United States USA total 2008 304093966.0 NaN
2533 United States USA under18 2008 74104602.0 NaN
2534 United States USA under18 2013 73585872.0 NaN
2535 United States USA total 2013 316128839.0 NaN
2536 United States USA total 2009 306771529.0 NaN
2537 United States USA under18 2009 74134167.0 NaN
2538 United States USA under18 2010 74119556.0 NaN
2539 United States USA total 2010 309326295.0 NaN
2540 United States USA under18 2011 73902222.0 NaN
2541 United States USA total 2011 311582564.0 NaN
2542 United States USA under18 2012 73708179.0 NaN
2543 United States USA total 2012 313873685.0 NaN
1
2
3
4
5
6
7
# 找出2010年的全民人口数据(基于df做条件查询)
print(abb_pop_area.query('ages == "total" & year == 2010'))
# 计算各州的人口密度(人口/面积)
abb_pop_area['density of population'] = abb_pop_area['population'] / abb_pop_area['area (sq. mi)']
abb_pop_area.head()
# 排序,并找出人口密度最高的州
abb_pop_area.sort_values(by='density of population',axis=0,ascending=False).iloc[0]['state'] # ascending是否升序排序,默认升序为True,降序则为False。
                     state state/region   ages  year   population  \
3                  Alabama           AL  total  2010    4785570.0   
91                  Alaska           AK  total  2010     713868.0   
101                Arizona           AZ  total  2010    6408790.0   
189               Arkansas           AR  total  2010    2922280.0   
197             California           CA  total  2010   37333601.0   
283               Colorado           CO  total  2010    5048196.0   
293            Connecticut           CT  total  2010    3579210.0   
379               Delaware           DE  total  2010     899711.0   
389   District of Columbia           DC  total  2010     605125.0   
475                Florida           FL  total  2010   18846054.0   
485                Georgia           GA  total  2010    9713248.0   
570                 Hawaii           HI  total  2010    1363731.0   
581                  Idaho           ID  total  2010    1570718.0   
666               Illinois           IL  total  2010   12839695.0   
677                Indiana           IN  total  2010    6489965.0   
762                   Iowa           IA  total  2010    3050314.0   
773                 Kansas           KS  total  2010    2858910.0   
858               Kentucky           KY  total  2010    4347698.0   
869              Louisiana           LA  total  2010    4545392.0   
954                  Maine           ME  total  2010    1327366.0   
965                Montana           MT  total  2010     990527.0   
1050              Nebraska           NE  total  2010    1829838.0   
1061                Nevada           NV  total  2010    2703230.0   
1146         New Hampshire           NH  total  2010    1316614.0   
1157            New Jersey           NJ  total  2010    8802707.0   
1242            New Mexico           NM  total  2010    2064982.0   
1253              New York           NY  total  2010   19398228.0   
1338        North Carolina           NC  total  2010    9559533.0   
1349          North Dakota           ND  total  2010     674344.0   
1434                  Ohio           OH  total  2010   11545435.0   
1445              Oklahoma           OK  total  2010    3759263.0   
1530                Oregon           OR  total  2010    3837208.0   
1541              Maryland           MD  total  2010    5787193.0   
1626         Massachusetts           MA  total  2010    6563263.0   
1637              Michigan           MI  total  2010    9876149.0   
1722             Minnesota           MN  total  2010    5310337.0   
1733           Mississippi           MS  total  2010    2970047.0   
1818              Missouri           MO  total  2010    5996063.0   
1829          Pennsylvania           PA  total  2010   12710472.0   
1914          Rhode Island           RI  total  2010    1052669.0   
1925        South Carolina           SC  total  2010    4636361.0   
2010          South Dakota           SD  total  2010     816211.0   
2021             Tennessee           TN  total  2010    6356683.0   
2106                 Texas           TX  total  2010   25245178.0   
2117                  Utah           UT  total  2010    2774424.0   
2202               Vermont           VT  total  2010     625793.0   
2213              Virginia           VA  total  2010    8024417.0   
2298            Washington           WA  total  2010    6742256.0   
2309         West Virginia           WV  total  2010    1854146.0   
2394             Wisconsin           WI  total  2010    5689060.0   
2405               Wyoming           WY  total  2010     564222.0   
2490           Puerto Rico           PR  total  2010    3721208.0   
2539         United States          USA  total  2010  309326295.0   

      area (sq. mi)         midu  density of population  
3           52423.0    91.287603              91.287603  
91         656425.0     1.087509               1.087509  
101        114006.0    56.214497              56.214497  
189         53182.0    54.948667              54.948667  
197        163707.0   228.051342             228.051342  
283        104100.0    48.493718              48.493718  
293          5544.0   645.600649             645.600649  
379          1954.0   460.445752             460.445752  
389            68.0  8898.897059            8898.897059  
475         65758.0   286.597129             286.597129  
485         59441.0   163.409902             163.409902  
570         10932.0   124.746707             124.746707  
581         83574.0    18.794338              18.794338  
666         57918.0   221.687472             221.687472  
677         36420.0   178.197831             178.197831  
762         56276.0    54.202751              54.202751  
773         82282.0    34.745266              34.745266  
858         40411.0   107.586994             107.586994  
869         51843.0    87.676099              87.676099  
954         35387.0    37.509990              37.509990  
965        147046.0     6.736171               6.736171  
1050        77358.0    23.654153              23.654153  
1061       110567.0    24.448796              24.448796  
1146         9351.0   140.799273             140.799273  
1157         8722.0  1009.253268            1009.253268  
1242       121593.0    16.982737              16.982737  
1253        54475.0   356.094135             356.094135  
1338        53821.0   177.617157             177.617157  
1349        70704.0     9.537565               9.537565  
1434        44828.0   257.549634             257.549634  
1445        69903.0    53.778278              53.778278  
1530        98386.0    39.001565              39.001565  
1541        12407.0   466.445797             466.445797  
1626        10555.0   621.815538             621.815538  
1637        96810.0   102.015794             102.015794  
1722        86943.0    61.078373              61.078373  
1733        48434.0    61.321530              61.321530  
1818        69709.0    86.015622              86.015622  
1829        46058.0   275.966651             275.966651  
1914         1545.0   681.339159             681.339159  
1925        32007.0   144.854594             144.854594  
2010        77121.0    10.583512              10.583512  
2021        42146.0   150.825298             150.825298  
2106       268601.0    93.987655              93.987655  
2117        84904.0    32.677188              32.677188  
2202         9615.0    65.085075              65.085075  
2213        42769.0   187.622273             187.622273  
2298        71303.0    94.557817              94.557817  
2309        24231.0    76.519582              76.519582  
2394        65503.0    86.851900              86.851900  
2405        97818.0     5.768079               5.768079  
2490         3515.0  1058.665149            1058.665149  
2539            NaN          NaN                    NaN  





'District of Columbia'

本文提及的数据集下载地址:
链接:https://pan.baidu.com/s/1lhAiSXBw47MK9QdqwA05Lg
提取码:1111


Donate
  • Copyright: Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.
  • Copyrights © 2019-2022 Woody
  • Visitors: | Views:

请我喝杯咖啡吧~

支付宝
微信