处理丢失数据

2021-05-25

Word count: 2.7k | Reading time≈ 14 min

处理丢失数据

1、原始数据中会存在缺失值（空值）
2、重复值
3、异常值

有两种丢失数据
- None
- np.nan(NAN)

'''
两种丢失数据的区别
'''
import numpy as np
print(type(None))   # None对象类型
print(type(np.nan)) # NAN浮点型

<class 'NoneType'>
<class 'float'>

为什么在数据分析中需要用到的是浮点类型的空而不是对象类型？
- 数据分析中会经常使用某些形式的运算来处理原始数据，如果原始数据中的空值为NAN形式，则不会干扰或者中断运算，NAN是可以参与运算的；
- None是不可以参与运算的
在pandas中如果遇到了None形式的空值则pandas会将其强转成NAN的形式。

1 2	print(np.nan+1) # 输出结果仍为nan print(None+1) # 会出现编译错误

nan



---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-49-e38e0a464830> in <module>()
      1 print(np.nan+1) # 输出结果仍为nan
----> 2 print(None+1)   # 会出现编译错误


TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

pandas处理空值操作

isnull
notnull
any
all
dropna
fillna

import pandas as pd
from pandas import DataFrame,Series
# 创建一组含有空值的数据
df = DataFrame(np.random.randint(0,100,size=(8,6)))
df.iloc[2,3] = None
df.iloc[4,4] = np.nan
df.iloc[5,2] = None
print(df)

    0   1     2     3     4   5
0  44  13  76.0  12.0  75.0  27
1   7  67  62.0  46.0  95.0  60
2  88  31   2.0   NaN  18.0  33
3  90   1  61.0  66.0  50.0  69
4  54  78  65.0  31.0   NaN  19
5  97   7   NaN  37.0  26.0  90
6  52  79   2.0  46.0   1.0  21
7   7  58  29.0  87.0   7.0  49

方式1：对空值进行过滤（删除空值所在的行数据）
- isnull, notnull, any, all
  - isnull->any
  - notnull->all

print(df.isnull())
# 哪些行中存在true
print(df.isnull().any(axis=1)) # any：用来检测行或列中是否存在True，若存在返回True否则返回False
# 将该Boolean值作为行索引，即可得到存在缺失值的行数据以及相应的行索引
print(df.loc[df.isnull().any(axis=1)])
drop_index = df.loc[df.isnull().any(axis=1)].index  # 将要删除的行索引
print(df.drop(labels=drop_index,axis=0))

       0      1      2      3      4      5
0  False  False  False  False  False  False
1  False  False  False  False  False  False
2  False  False  False   True  False  False
3  False  False  False  False  False  False
4  False  False  False  False   True  False
5  False  False   True  False  False  False
6  False  False  False  False  False  False
7  False  False  False  False  False  False
0    False
1    False
2     True
3    False
4     True
5     True
6    False
7    False
dtype: bool
    0   1     2     3     4   5
2  88  31   2.0   NaN  18.0  33
4  54  78  65.0  31.0   NaN  19
5  97   7   NaN  37.0  26.0  90
    0   1     2     3     4   5
0  44  13  76.0  12.0  75.0  27
1   7  67  62.0  46.0  95.0  60
3  90   1  61.0  66.0  50.0  69
6  52  79   2.0  46.0   1.0  21
7   7  58  29.0  87.0   7.0  49

1
2
3

print(df.notnull())
print(df.notnull().all(axis=1)) # all:用来检测行或列中是否存在False,如果全为True返回True否则返回False
print(df.loc[df.notnull().all(axis=1)]) # 以True为索引即忽略了含有缺失值的行

      0     1      2      3      4     5
0  True  True   True   True   True  True
1  True  True   True   True   True  True
2  True  True   True  False   True  True
3  True  True   True   True   True  True
4  True  True   True   True  False  True
5  True  True  False   True   True  True
6  True  True   True   True   True  True
7  True  True   True   True   True  True
0     True
1     True
2    False
3     True
4    False
5    False
6     True
7     True
dtype: bool
    0   1     2     3     4   5
0  44  13  76.0  12.0  75.0  27
1   7  67  62.0  46.0  95.0  60
3  90   1  61.0  66.0  50.0  69
6  52  79   2.0  46.0   1.0  21
7   7  58  29.0  87.0   7.0  49

方式2：
- dropna:可以直接将缺失的行或者列进行删除

1 2	print(df.dropna(axis=0)) print(df.dropna(axis=1))

    0   1     2     3     4   5
0  44  13  76.0  12.0  75.0  27
1   7  67  62.0  46.0  95.0  60
3  90   1  61.0  66.0  50.0  69
6  52  79   2.0  46.0   1.0  21
7   7  58  29.0  87.0   7.0  49
    0   1   5
0  44  13  27
1   7  67  60
2  88  31  33
3  90   1  69
4  54  78  19
5  97   7  90
6  52  79  21
7   7  58  49

对缺失值进行覆盖
- fillna

print(df.fillna(value=666)) # 使用设定值覆盖缺失值，但合理性不足意义不大
print(df.fillna(method='ffill',axis=1)) # 一般采用缺失值周围临近的值去覆盖,ffill使用前值覆盖bfill使用后值覆盖，axis选择水平或垂直方向
print(df.fillna(method='bfill',axis=0)) # 使用垂直方向的后值覆盖
'''
但无论如何覆盖都不是最合理的，故一般选择删除而不选择覆盖，假如删除的成本太高（删除数据太多），才选择覆盖
'''

    0   1      2      3      4   5
0  44  13   76.0   12.0   75.0  27
1   7  67   62.0   46.0   95.0  60
2  88  31    2.0  666.0   18.0  33
3  90   1   61.0   66.0   50.0  69
4  54  78   65.0   31.0  666.0  19
5  97   7  666.0   37.0   26.0  90
6  52  79    2.0   46.0    1.0  21
7   7  58   29.0   87.0    7.0  49
      0     1     2     3     4     5
0  44.0  13.0  76.0  12.0  75.0  27.0
1   7.0  67.0  62.0  46.0  95.0  60.0
2  88.0  31.0   2.0   2.0  18.0  33.0
3  90.0   1.0  61.0  66.0  50.0  69.0
4  54.0  78.0  65.0  31.0  31.0  19.0
5  97.0   7.0   7.0  37.0  26.0  90.0
6  52.0  79.0   2.0  46.0   1.0  21.0
7   7.0  58.0  29.0  87.0   7.0  49.0
    0   1     2     3     4   5
0  44  13  76.0  12.0  75.0  27
1   7  67  62.0  46.0  95.0  60
2  88  31   2.0  66.0  18.0  33
3  90   1  61.0  66.0  50.0  69
4  54  78  65.0  31.0  26.0  19
5  97   7   2.0  37.0  26.0  90
6  52  79   2.0  46.0   1.0  21
7   7  58  29.0  87.0   7.0  49

实例
- 处理丢失数据和空值
- 此处所用数据集在文章末尾给出

df2 = pd.read_excel('testData.xlsx',engine='openpyxl')
print(df2.head())
df2 = df2[[1,2,3,4]]
df2.head()

                 time  none     1     2     3     4  none1     5     6     7
0 2019-01-27 17:00:00   NaN -24.8 -18.2 -20.8 -18.8    NaN   NaN   NaN   NaN
1 2019-01-27 17:01:00   NaN -23.5 -18.8 -20.5 -19.8    NaN -15.2 -14.5 -16.0
2 2019-01-27 17:02:00   NaN -23.2 -19.2   NaN   NaN    NaN -13.0   NaN -14.0
3 2019-01-27 17:03:00   NaN -22.8 -19.2 -20.0 -20.5    NaN   NaN -12.2  -9.8
4 2019-01-27 17:04:00   NaN -23.2 -18.5 -20.0 -18.8    NaN -10.2 -10.8  -8.8
C:\Users\86156\.conda\envs\DL\lib\site-packages\openpyxl\worksheet\_reader.py:312: UserWarning: Unknown extension is not supported and will be removed
  warn(msg)

	1	2	3	4
0	-24.8	-18.2	-20.8	-18.8
1	-23.5	-18.8	-20.5	-19.8
2	-23.2	-19.2	NaN	NaN
3	-22.8	-19.2	-20.0	-20.5
4	-23.2	-18.5	-20.0	-18.8

# 可将空值对应的行数据删除
print(df2.dropna(axis=0))
print(df2.notnull().all(axis=1))
print(df2.loc[df2.notnull().all(axis=1)])

         1     2     3     4
0    -24.8 -18.2 -20.8 -18.8
1    -23.5 -18.8 -20.5 -19.8
3    -22.8 -19.2 -20.0 -20.5
4    -23.2 -18.5 -20.0 -18.8
7    -24.8 -18.0 -17.5 -17.2
...    ...   ...   ...   ...
1055 -26.2 -27.2 -28.8 -27.5
1056 -26.8 -27.5 -29.0 -27.8
1057 -27.2 -27.8 -29.0 -28.0
1058 -27.5 -27.0 -29.0 -28.0
1059 -27.0 -27.2 -29.0 -27.8

[982 rows x 4 columns]
0        True
1        True
2       False
3        True
4        True
        ...  
1055     True
1056     True
1057     True
1058     True
1059     True
Length: 1060, dtype: bool
         1     2     3     4
0    -24.8 -18.2 -20.8 -18.8
1    -23.5 -18.8 -20.5 -19.8
3    -22.8 -19.2 -20.0 -20.5
4    -23.2 -18.5 -20.0 -18.8
7    -24.8 -18.0 -17.5 -17.2
...    ...   ...   ...   ...
1055 -26.2 -27.2 -28.8 -27.5
1056 -26.8 -27.5 -29.0 -27.8
1057 -27.2 -27.8 -29.0 -28.0
1058 -27.5 -27.0 -29.0 -28.0
1059 -27.0 -27.2 -29.0 -27.8

[982 rows x 4 columns]

1
2
3

# 填充空值
data = df2.fillna(method='ffill',axis=0).fillna(method='bfill',axis=0)  # 前部填充一次再候补填充一次确保开头和结尾没有缺失值
print(data.isnull().any(axis=0))    # 检测下是否填充完整，any检查是否有True，此处4列中均无True故返回False

1    False
2    False
3    False
4    False
dtype: bool

处理重复数据

# 生成一组带有重复数据的数据源
df3 = DataFrame(np.random.randint(0,100,size=(8,4)))
df3.iloc[2] = [0, 0, 0, 0]
df3.iloc[5] = [0, 0, 0, 0]
print(df3)

    0   1   2   3
0  77  77  99  27
1  70  38  58  77
2   0   0   0   0
3  93  73  10   8
4  66  45  98  29
5   0   0   0   0
6  14  40  61  81
7  67  10  92  42

1
2
3

# 使用drop_duplicates
print(df3.drop_duplicates(keep='first'))    # keep为保留第几条重复数据，也可使用last保留最后一条
print(df3.drop_duplicates(keep=False))    # keep=False,删除所有重复数据

    0   1   2   3
0  77  77  99  27
1  70  38  58  77
2   0   0   0   0
3  93  73  10   8
4  66  45  98  29
6  14  40  61  81
7  67  10  92  42
    0   1   2   3
0  77  77  99  27
1  70  38  58  77
3  93  73  10   8
4  66  45  98  29
6  14  40  61  81
7  67  10  92  42

处理异常数据

自定义一个1000行3列(A,B,C)取值范围为0-1的数据源，然后将C列中的值大于其两倍标准差的异常值进行清洗

df4 = DataFrame(np.random.random(size=(1000,3)),columns=['A','B','C'])
print(df4)
# 判定异常值的条件
twice_std = df4['C'].std() * 2
print(df4['C'] > twice_std)  # 此时True为异常值
print(~(df4['C'] > twice_std))   # 对其取反则False为异常值，使用True作为索引即可保留正常值
print(df4.loc[~(df4['C'] > twice_std)])

            A         B         C
0    0.215693  0.959250  0.528953
1    0.136710  0.114971  0.941449
2    0.295452  0.295154  0.238188
3    0.187619  0.066753  0.999018
4    0.398977  0.446377  0.432628
..        ...       ...       ...
995  0.908120  0.968931  0.765458
996  0.649668  0.783493  0.058060
997  0.605072  0.751010  0.243683
998  0.304012  0.571452  0.004855
999  0.364485  0.937335  0.601111

[1000 rows x 3 columns]
0      False
1       True
2      False
3       True
4      False
       ...  
995     True
996    False
997    False
998    False
999     True
Name: C, Length: 1000, dtype: bool
0       True
1      False
2       True
3      False
4       True
       ...  
995    False
996     True
997     True
998     True
999    False
Name: C, Length: 1000, dtype: bool
            A         B         C
0    0.215693  0.959250  0.528953
2    0.295452  0.295154  0.238188
4    0.398977  0.446377  0.432628
5    0.402174  0.704057  0.327663
6    0.938079  0.919880  0.308305
..        ...       ...       ...
992  0.693904  0.050679  0.171809
994  0.663767  0.883826  0.119215
996  0.649668  0.783493  0.058060
997  0.605072  0.751010  0.243683
998  0.304012  0.571452  0.004855

[574 rows x 3 columns]

本文提及的数据集下载地址：
链接：https://pan.baidu.com/s/1lhAiSXBw47MK9QdqwA05Lg
提取码：1111

Donate

Copyright： Copyright is owned by the author. For commercial reprints, please contact the author for authorization. For non-commercial reprints, please indicate the source.