需求
加载数据
查看数据的基本信息
指定数据截取,将如下字段的数据进行提取,其他数据舍弃
cand_nm :候选人姓名
contbr_nm : 捐赠人姓名
contbr_st :捐赠人所在州
contbr_employer : 捐赠人所在公司
contbr_occupation : 捐赠人职业
contb_receipt_amt :捐赠数额(美元)
contb_receipt_dt : 捐款的日期
对新数据进行总览,查看是否存在缺失数据
用统计学指标快速描述数值型属性的概要
空值处理,可能因为忘记填写或者保密等等原因,相关字段出现了空值,将其填充为NOT PROVIDE
异常值处理,将捐款金额<=0的数据删除
新建一列为各个候选人所在党派party
查看party这一列中有哪些不同的元素
统计party列中各个元素出现次数
查看各个党派收到的政治献金总数contb_receipt_amt
查看具体每天各个党派收到的政治献金总数contb_receipt_amt
将表中日期格式转换为’yyyy-mm-dd’
查看老兵(捐献者职业)DISABLED VETERAN主要支持谁
1 2 3 4 5 COPY import numpy as npimport pandas as pddf = pd.read_csv('./data/usa_election.txt' ) df
COPY C:\Users\86156\.conda\envs\DL\lib\site-packages\IPython\core\interactiveshell.py:2714: DtypeWarning: Columns (6) have mixed types.Specify dtype option on import or set low_memory=False.
interactivity=interactivity, compiler=compiler, result=result)
cmte_id
cand_id
cand_nm
contbr_nm
contbr_city
contbr_st
contbr_zip
contbr_employer
contbr_occupation
contb_receipt_amt
contb_receipt_dt
receipt_desc
memo_cd
memo_text
form_tp
file_num
0
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
250.0
20-JUN-11
NaN
NaN
NaN
SA17A
736166
1
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
50.0
23-JUN-11
NaN
NaN
NaN
SA17A
736166
2
C00410118
P20002978
Bachmann, Michelle
SMITH, LANIER
LANETT
AL
3.68633e+08
INFORMATION REQUESTED
INFORMATION REQUESTED
250.0
05-JUL-11
NaN
NaN
NaN
SA17A
749073
3
C00410118
P20002978
Bachmann, Michelle
BLEVINS, DARONDA
PIGGOTT
AR
7.24548e+08
NONE
RETIRED
250.0
01-AUG-11
NaN
NaN
NaN
SA17A
749073
4
C00410118
P20002978
Bachmann, Michelle
WARDENBURG, HAROLD
HOT SPRINGS NATION
AR
7.19016e+08
NONE
RETIRED
300.0
20-JUN-11
NaN
NaN
NaN
SA17A
736166
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
536036
C00500587
P20003281
Perry, Rick
ANDERSON, MARILEE MRS.
INFO REQUESTED
XX
99999
INFORMATION REQUESTED PER BEST EFFORTS
INFORMATION REQUESTED PER BEST EFFORTS
2500.0
31-AUG-11
NaN
NaN
NaN
SA17A
751678
536037
C00500587
P20003281
Perry, Rick
TOLBERT, DARYL MR.
INFO REQUESTED
XX
99999
T.A.C.C.
LONGWALL MAINTENANCE FOREMAN
500.0
30-SEP-11
NaN
NaN
NaN
SA17A
751678
536038
C00500587
P20003281
Perry, Rick
GRANE, BRYAN F. MR.
INFO REQUESTED
XX
99999
INFORMATION REQUESTED PER BEST EFFORTS
INFORMATION REQUESTED PER BEST EFFORTS
500.0
29-SEP-11
NaN
NaN
NaN
SA17A
751678
536039
C00500587
P20003281
Perry, Rick
DUFFY, DAVID A. MR.
INFO REQUESTED
XX
99999
DUFFY EQUIPMENT COMPANY INC.
BUSINESS OWNER
2500.0
30-SEP-11
NaN
NaN
NaN
SA17A
751678
536040
C00500587
P20003281
Perry, Rick
GORMAN, CHRIS D. MR.
INFO REQUESTED
XX
99999
INFORMATION REQUESTED PER BEST EFFORTS
INFORMATION REQUESTED PER BEST EFFORTS
5000.0
29-SEP-11
REATTRIBUTION / REDESIGNATION REQUESTED (AUTOM...
NaN
REATTRIBUTION / REDESIGNATION REQUESTED (AUTOM...
SA17A
751678
536041 rows × 16 columns
COPY 0 Bachmann, Michelle
1 Bachmann, Michelle
2 Bachmann, Michelle
3 Bachmann, Michelle
4 Bachmann, Michelle
...
536036 Perry, Rick
536037 Perry, Rick
536038 Perry, Rick
536039 Perry, Rick
536040 Perry, Rick
Name: cand_nm, Length: 536041, dtype: object
COPY <class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmte_id 536041 non-null object
1 cand_id 536041 non-null object
2 cand_nm 536041 non-null object
3 contbr_nm 536041 non-null object
4 contbr_city 536026 non-null object
5 contbr_st 536040 non-null object
6 contbr_zip 535973 non-null object
7 contbr_employer 525088 non-null object
8 contbr_occupation 530520 non-null object
9 contb_receipt_amt 536041 non-null float64
10 contb_receipt_dt 536041 non-null object
11 receipt_desc 8479 non-null object
12 memo_cd 49718 non-null object
13 memo_text 52740 non-null object
14 form_tp 536041 non-null object
15 file_num 536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB
contb_receipt_amt
file_num
count
5.360410e+05
536041.000000
mean
3.750373e+02
761472.107800
std
3.564436e+03
5148.893508
min
-3.080000e+04
723511.000000
25%
5.000000e+01
756218.000000
50%
1.000000e+02
763233.000000
75%
2.500000e+02
763621.000000
max
1.944042e+06
767394.000000
1 2 3 COPY df.fillna(value='NOT PROVIDE' ,inplace=True ) df.info()
COPY <class 'pandas.core.frame.DataFrame'>
RangeIndex: 536041 entries, 0 to 536040
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 cmte_id 536041 non-null object
1 cand_id 536041 non-null object
2 cand_nm 536041 non-null object
3 contbr_nm 536041 non-null object
4 contbr_city 536041 non-null object
5 contbr_st 536041 non-null object
6 contbr_zip 536041 non-null object
7 contbr_employer 536041 non-null object
8 contbr_occupation 536041 non-null object
9 contb_receipt_amt 536041 non-null float64
10 contb_receipt_dt 536041 non-null object
11 receipt_desc 536041 non-null object
12 memo_cd 536041 non-null object
13 memo_text 536041 non-null object
14 form_tp 536041 non-null object
15 file_num 536041 non-null int64
dtypes: float64(1), int64(1), object(14)
memory usage: 65.4+ MB
1 2 3 4 5 6 COPY df['contb_receipt_amt' ] <= 0 df.loc[df['contb_receipt_amt' ] <= 0 ] drop_indexs = df.loc[df['contb_receipt_amt' ] <= 0 ].index df.drop(labels=drop_indexs,axis=0 ,inplace=True ) df
cmte_id
cand_id
cand_nm
contbr_nm
contbr_city
contbr_st
contbr_zip
contbr_employer
contbr_occupation
contb_receipt_amt
contb_receipt_dt
receipt_desc
memo_cd
memo_text
form_tp
file_num
0
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
250.0
20-JUN-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
1
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
50.0
23-JUN-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
2
C00410118
P20002978
Bachmann, Michelle
SMITH, LANIER
LANETT
AL
3.68633e+08
INFORMATION REQUESTED
INFORMATION REQUESTED
250.0
05-JUL-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
749073
3
C00410118
P20002978
Bachmann, Michelle
BLEVINS, DARONDA
PIGGOTT
AR
7.24548e+08
NONE
RETIRED
250.0
01-AUG-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
749073
4
C00410118
P20002978
Bachmann, Michelle
WARDENBURG, HAROLD
HOT SPRINGS NATION
AR
7.19016e+08
NONE
RETIRED
300.0
20-JUN-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
536036
C00500587
P20003281
Perry, Rick
ANDERSON, MARILEE MRS.
INFO REQUESTED
XX
99999
INFORMATION REQUESTED PER BEST EFFORTS
INFORMATION REQUESTED PER BEST EFFORTS
2500.0
31-AUG-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
751678
536037
C00500587
P20003281
Perry, Rick
TOLBERT, DARYL MR.
INFO REQUESTED
XX
99999
T.A.C.C.
LONGWALL MAINTENANCE FOREMAN
500.0
30-SEP-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
751678
536038
C00500587
P20003281
Perry, Rick
GRANE, BRYAN F. MR.
INFO REQUESTED
XX
99999
INFORMATION REQUESTED PER BEST EFFORTS
INFORMATION REQUESTED PER BEST EFFORTS
500.0
29-SEP-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
751678
536039
C00500587
P20003281
Perry, Rick
DUFFY, DAVID A. MR.
INFO REQUESTED
XX
99999
DUFFY EQUIPMENT COMPANY INC.
BUSINESS OWNER
2500.0
30-SEP-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
751678
536040
C00500587
P20003281
Perry, Rick
GORMAN, CHRIS D. MR.
INFO REQUESTED
XX
99999
INFORMATION REQUESTED PER BEST EFFORTS
INFORMATION REQUESTED PER BEST EFFORTS
5000.0
29-SEP-11
REATTRIBUTION / REDESIGNATION REQUESTED (AUTOM...
NOT PROVIDE
REATTRIBUTION / REDESIGNATION REQUESTED (AUTOM...
SA17A
751678
530314 rows × 16 columns
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 COPY parties = { 'Bachmann, Michelle' : 'Republican' , 'Romney, Mitt' : 'Republican' , 'Obama, Barack' : 'Democrat' , "Roemer, Charles E. 'Buddy' III" : 'Reform' , 'Pawlenty, Timothy' : 'Republican' , 'Johnson, Gary Earl' : 'Libertarian' , 'Paul, Ron' : 'Republican' , 'Santorum, Rick' : 'Republican' , 'Cain, Herman' : 'Republican' , 'Gingrich, Newt' : 'Republican' , 'McCotter, Thaddeus G' : 'Republican' , 'Huntsman, Jon' : 'Republican' , 'Perry, Rick' : 'Republican' } df['cand_nm' ].map (parties) df['party' ] = df['cand_nm' ].map (parties) df.head()
cmte_id
cand_id
cand_nm
contbr_nm
contbr_city
contbr_st
contbr_zip
contbr_employer
contbr_occupation
contb_receipt_amt
contb_receipt_dt
receipt_desc
memo_cd
memo_text
form_tp
file_num
party
0
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
250.0
20-JUN-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
Republican
1
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
50.0
23-JUN-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
Republican
2
C00410118
P20002978
Bachmann, Michelle
SMITH, LANIER
LANETT
AL
3.68633e+08
INFORMATION REQUESTED
INFORMATION REQUESTED
250.0
05-JUL-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
749073
Republican
3
C00410118
P20002978
Bachmann, Michelle
BLEVINS, DARONDA
PIGGOTT
AR
7.24548e+08
NONE
RETIRED
250.0
01-AUG-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
749073
Republican
4
C00410118
P20002978
Bachmann, Michelle
WARDENBURG, HAROLD
HOT SPRINGS NATION
AR
7.19016e+08
NONE
RETIRED
300.0
20-JUN-11
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
Republican
1 2 COPY df['party' ].unique()
COPY array(['Republican', 'Democrat', 'Reform', 'Libertarian'], dtype=object)
1 2 COPY df['party' ].value_counts()
COPY Democrat 289999
Republican 234300
Reform 5313
Libertarian 702
Name: party, dtype: int64
1 2 COPY df.groupby('party' )['contb_receipt_amt' ].sum ()
COPY party
Democrat 8.259441e+07
Libertarian 4.132769e+05
Reform 3.429658e+05
Republican 1.251181e+08
Name: contb_receipt_amt, dtype: float64
1 2 COPY df.groupby(['contb_receipt_dt' ,'party' ])['contb_receipt_amt' ].sum ()
COPY contb_receipt_dt party
01-APR-11 Reform 50.00
Republican 12635.00
01-AUG-11 Democrat 182198.00
Libertarian 1000.00
Reform 1847.00
...
31-MAY-11 Republican 313839.80
31-OCT-11 Democrat 216971.87
Libertarian 4250.00
Reform 3205.00
Republican 751542.36
Name: contb_receipt_amt, Length: 1183, dtype: float64
1 2 3 4 5 6 7 8 9 COPY months = {'JAN' : 1 , 'FEB' : 2 , 'MAR' : 3 , 'APR' : 4 , 'MAY' : 5 , 'JUN' : 6 , 'JUL' : 7 , 'AUG' : 8 , 'SEP' : 9 , 'OCT' : 10 , 'NOV' : 11 , 'DEC' : 12 } def transformDate (d ): day,month,year = d.split('-' ) month = months[month] return '20' +year+'-' +str (month)+'-' +day df['contb_receipt_dt' ] = df['contb_receipt_dt' ].map (transformDate) df.head()
cmte_id
cand_id
cand_nm
contbr_nm
contbr_city
contbr_st
contbr_zip
contbr_employer
contbr_occupation
contb_receipt_amt
contb_receipt_dt
receipt_desc
memo_cd
memo_text
form_tp
file_num
party
0
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
250.0
2011-6-20
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
Republican
1
C00410118
P20002978
Bachmann, Michelle
HARVEY, WILLIAM
MOBILE
AL
3.6601e+08
RETIRED
RETIRED
50.0
2011-6-23
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
Republican
2
C00410118
P20002978
Bachmann, Michelle
SMITH, LANIER
LANETT
AL
3.68633e+08
INFORMATION REQUESTED
INFORMATION REQUESTED
250.0
2011-7-05
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
749073
Republican
3
C00410118
P20002978
Bachmann, Michelle
BLEVINS, DARONDA
PIGGOTT
AR
7.24548e+08
NONE
RETIRED
250.0
2011-8-01
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
749073
Republican
4
C00410118
P20002978
Bachmann, Michelle
WARDENBURG, HAROLD
HOT SPRINGS NATION
AR
7.19016e+08
NONE
RETIRED
300.0
2011-6-20
NOT PROVIDE
NOT PROVIDE
NOT PROVIDE
SA17A
736166
Republican
1 2 3 4 COPY df['contbr_occupation' ] == 'DISABLED VETERAN' df_old = df.loc[df['contbr_occupation' ] == 'DISABLED VETERAN' ] df_old.groupby('cand_nm' )['contb_receipt_amt' ].sum ()
COPY cand_nm
Cain, Herman 300.00
Obama, Barack 4205.00
Paul, Ron 2425.49
Santorum, Rick 250.00
Name: contb_receipt_amt, dtype: float64
本文提及的数据集下载地址: 链接:https://pan.baidu.com/s/1lhAiSXBw47MK9QdqwA05Lg 提取码:1111