清洗數(shù)據(jù),高效率7步走!

導(dǎo)讀:本文使用泰坦尼克數(shù)據(jù)集,完整介紹數(shù)據(jù)清洗 7 步的具體操作過(guò)程。

數(shù)據(jù)清洗 (data cleaning) 是機(jī)器學(xué)習(xí)和深度學(xué)習(xí)進(jìn)入算法步前的一項(xiàng)重要任務(wù),我平時(shí)比較習(xí)慣使用的 7 個(gè)步驟,總結(jié)如下:
Step1 : read csv
Step2 : preview data
Step3: check null value for every column
Step4: complete null value
Step5: feature engineering
Step 5.1: delete some features
Step 5.2: create new feature
Step6: encode for categories columns
Step 6.1: Sklearn LabelEncode
Step 6.2: Pandas get_dummies
Step 7: check for data cleaning
01 讀入數(shù)據(jù)
這不廢話嗎,第一步就是讀入數(shù)據(jù)。
data_raw?=?pd.read_csv('../input/titanicdataset-traincsv/train.csv')
data_raw結(jié)果:
????PassengerId?Survived????Pclass??Name????Sex?Age?SibSp???Parch???Ticket??Fare????Cabin???Embarked
0????1???0???3???Braund,?Mr.?Owen?Harris?male????22.0????1???0???A/5?21171???7.2500??NaN?S
1????2???1???1???Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.0????1???0???PC?17599????71.2833?C85?C
2????3???1???3???Heikkinen,?Miss.?Laina??female??26.0????0???0???STON/O2.?3101282????7.9250??NaN?S
3????4???1???1???Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.0????1???0???113803??53.1000?C123????S
4????5???0???3???Allen,?Mr.?William?Henry????male????35.0????0???0???373450??8.0500??NaN?S
...????...?...?...?...?...?...?...?...?...?...?...?...
886????887?0???2???Montvila,?Rev.?Juozas???male????27.0????0???0???211536??13.0000?NaN?S
887????888?1???1???Graham,?Miss.?Margaret?Edith????female??19.0????0???0???112053??30.0000?B42?S
888????889?0???3???Johnston,?Miss.?Catherine?Helen?"Carrie"????female??NaN?1???2???W./C.?6607??23.4500?NaN?S
889????890?1???1???Behr,?Mr.?Karl?Howell???male????26.0????0???0???111369??30.0000?C148????C
890????891?0???3???Dooley,?Mr.?Patrick?male????32.0????0???0???370376??7.7500??NaN?Q
891?rows?×?12?columns02 數(shù)據(jù)預(yù)覽
data_raw.info()
data_raw.describe(include='all')結(jié)果:
<class?'pandas.core.frame.DataFrame'>
RangeIndex:891?entries,?0?to?890
Data?columns?(total?12?columns):
PassengerId????891?non-null?int64
Survived???????891?non-null?int64
Pclass?????????891?non-null?int64
Name???????????891?non-null?object
Sex????????????891?non-null?object
Age????????????714?non-null?float64
SibSp??????????891?non-null?int64
Parch??????????891?non-null?int64
Ticket?????????891?non-null?object
Fare???????????891?non-null?float64
Cabin??????????204?non-null?object
Embarked???????889?non-null?object
dtypes:?float64(2),?int64(5),?object(5)
memory?usage:?83.7+?KB
PassengerId????Survived????Pclass??Name????Sex?Age?SibSp???Parch???Ticket??Fare????Cabin???Embarked
count????891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
unique????NaN?NaN?NaN?8912????NaN?NaN?NaN?681?NaN?1473
top????NaN?NaN?NaN?Hakkarainen,?Mr.?Pekka?Pietari??male????NaN?NaN?NaN?1601????NaN?G6??S
freq????NaN?NaN?NaN?1577????NaN?NaN?NaN?7???NaN?4644
mean????446.0000000.3838382.308642??NaN?NaN?29.6991180.5230080.381594???NaN?32.204208???NaN?NaN
std????257.3538420.4865920.836071??NaN?NaN?14.5264971.1027430.806057???NaN?49.693429???NaN?NaN
min????1.0000000.0000001.000000????NaN?NaN?0.4200000.0000000.000000????NaN?0.000000????NaN?NaN
25%????223.5000000.0000002.000000??NaN?NaN?20.1250000.0000000.000000???NaN?7.910400????NaN?NaN
50%????446.0000000.0000003.000000??NaN?NaN?28.0000000.0000000.000000???NaN?14.454200???NaN?NaN
75%????668.5000001.0000003.000000??NaN?NaN?38.0000001.0000000.000000???NaN?31.000000???NaN?NaN
max????891.0000001.0000003.000000??NaN?NaN?80.0000008.0000006.000000???NaN?512.329200??NaN?N03 檢查null值
data1?=?data_raw.copy(deep=True)
data1.isnull().sum()結(jié)果:
PassengerId??????0
Survived?????????0
Pclass???????????0
Name?????????????0
Sex??????????????0
Age????????????177
SibSp????????????0
Parch????????????0
Ticket???????????0
Fare?????????????0
Cabin??????????687
Embarked?????????2
dtype:?int64Age 列 177 個(gè)空值,Cabin 687 個(gè)空值,一共才 891 行,估計(jì)沒(méi)啥價(jià)值了!Embarked 2 個(gè)。
04 補(bǔ)全空值
data1['Age'].fillna(data1['Age'].median(),?inplace?=?True)
data1['Embarked'].fillna(data1['Embarked'].mode()[0],?inplace?=?True)
data1.isnull().sum()補(bǔ)全操作check:
PassengerId??????0
Survived?????????0
Pclass???????????0
Name?????????????0
Sex??????????????0
Age??????????????0
SibSp????????????0
Parch????????????0
Ticket???????????0
Fare?????????????0
Cabin??????????687
Embarked?????????0
dtype:?int6405 特征工程
1. 干掉 3 列
drop_column?=?['PassengerId','Cabin',?'Ticket']
data1.drop(drop_column,?axis=1,?inplace?=?True)2. 增加 3 列
增加一列?FamilySize
data1['FamilySize']?=?data1?['SibSp']?+?data1['Parch']?+?1
data1打印結(jié)果:
Survived????Pclass??Name????Sex?Age?SibSp???Parch???Fare????Embarked????FamilySize
003????Braund,?Mr.?Owen?Harris?male????22.0107.2500????S???2
111????Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.01071.2833???C???2
213????Heikkinen,?Miss.?Laina??female??26.0007.9250????S???1
311????Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.01053.1000???S???2
403????Allen,?Mr.?William?Henry????male????35.0008.0500????S???1
...????...?...?...?...?...?...?...?...?...?...
88602????Montvila,?Rev.?Juozas???male????27.00013.0000???S???1
88711????Graham,?Miss.?Margaret?Edith????female??19.00030.0000???S???1
88803????Johnston,?Miss.?Catherine?Helen?"Carrie"????female??28.01223.4500???S???4
88911????Behr,?Mr.?Karl?Howell???male????26.00030.0000???C???1
89003????Dooley,?Mr.?Patrick?male????32.0007.7500????Q???1
891?rows?×?10?columns再創(chuàng)建一列:
data1['IsAlone']?=?np.where(data1['FamilySize']?>?1,0,1)再創(chuàng)建一列:
data1['Title']?=?data1['Name'].str.split(",?",?expand=True)[1].str.split(".",?expand=True)[0]
data1結(jié)果:
Survived????Pclass??Name????Sex?Age?SibSp???Parch???Fare????Embarked????FamilySize??IsAlone?Title
003????Braund,?Mr.?Owen?Harris?male????22.0107.2500????S???20??Mr
111????Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.01071.2833???C???20??Mrs
213????Heikkinen,?Miss.?Laina??female??26.0007.9250????S???11??Miss
311????Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.01053.1000???S???20??Mrs
403????Allen,?Mr.?William?Henry????male????35.0008.0500????S???11??Mr
...????...?...?...?...?...?...?...?...?...?...?...?...
88602????Montvila,?Rev.?Juozas???male????27.00013.0000???S???11??Rev
88711????Graham,?Miss.?Margaret?Edith????female??19.00030.0000???S???11??Miss
88803????Johnston,?Miss.?Catherine?Helen?"Carrie"????female??28.01223.4500???S???40??Miss
88911????Behr,?Mr.?Karl?Howell???male????26.00030.0000???C???11??Mr
89003????Dooley,?Mr.?Patrick?male????32.0007.7500????Q???11??Mr
891?rows?×?12?columns3. 分箱走起
data1['FareCut']?=?pd.qcut(data1['Fare'],?4)
data1['AgeCut']?=?pd.cut(data1['Age'].astype(int),?6)
data1結(jié)果:
????Survived????Pclass??Name????Sex?Age?SibSp???Parch???Fare????Embarked????FamilySize??IsAlone?Title???FareCut?AgeCut
0????0???3???Braund,?Mr.?Owen?Harris?male????22.0????1???0???7.2500??S???2???0???Mr??(-0.001,?7.91]??(13.333,?26.667]
1????1???1???Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.0????1???0???71.2833?C???2???0???Mrs?(31.0,?512.329]?(26.667,?40.0]
2????1???3???Heikkinen,?Miss.?Laina??female??26.0????0???0???7.9250??S???1???1???Miss????(7.91,?14.454]??(13.333,?26.667]
3????1???1???Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.0????1???0???53.1000?S???2???0???Mrs?(31.0,?512.329]?(26.667,?40.0]
4????0???3???Allen,?Mr.?William?Henry????male????35.0????0???0???8.0500??S???1???1???Mr??(7.91,?14.454]??(26.667,?40.0]
...????...?...?...?...?...?...?...?...?...?...?...?...?...?...
886????0???2???Montvila,?Rev.?Juozas???male????27.0????0???0???13.0000?S???1???1???Rev?(7.91,?14.454]??(26.667,?40.0]
887????1???1???Graham,?Miss.?Margaret?Edith????female??19.0????0???0???30.0000?S???1???1???Miss????(14.454,?31.0]??(13.333,?26.667]
888????0???3???Johnston,?Miss.?Catherine?Helen?"Carrie"????female??28.0????1???2???23.4500?S???4???0???Miss????(14.454,?31.0]??(26.667,?40.0]
889????1???1???Behr,?Mr.?Karl?Howell???male????26.0????0???0???30.0000?C???1???1???Mr??(14.454,?31.0]??(13.333,?26.667]
890????0???3???Dooley,?Mr.?Patrick?male????32.0????0???0???7.7500??Q???1???1???Mr??(-0.001,?7.91]??(26.667,?40.0]
891?rows?×?14?columns06 編碼
1. LabelEncoder 方法
使用 Sklearn 的 LabelEncoder
from?sklearn.preprocessing?import?LabelEncoder
label?=?LabelEncoder()
data1['Sex_Code']?=?label.fit_transform(data1['Sex'])
data1['Embarked_Code']?=?label.fit_transform(data1['Embarked'])
data1['Title_Code']?=?label.fit_transform(data1['Title'])
data1['AgeBin_Code']?=?label.fit_transform(data1['AgeCut'])
data1['FareBin_Code']?=?label.fit_transform(data1['FareCut'])
data1結(jié)果 data1 選取某些列,算法模型終于能認(rèn)出它們了,多不容易!
2. get_dummies 方法
get_dummies 將長(zhǎng) DataFrame 變?yōu)閷?DataFrame:
pd.get_dummies(data1['Sex'])結(jié)果:
female????male
0????0???1
1????1???0
2????1???0
3????1???0
4????0???1
...????...?...
886????0???1
887????1???0
888????1???0
889????0???1
890????0???1
891?rows?×?2?columns而 LabelEncoder 編碼后,僅僅是把 Female 編碼為 0, male 編碼為 1.
label.fit_transform(data1['Sex'])
0??????1
1??????0
2??????0
3??????0
4??????1
??????..
886????1
887????0
888????0
889????1
890????1
Name:?Sex_Code,?Length:?891,?dtype:?int6407 再 check
#?Step?7:?data?cleaning?check
data1[data1_x_alg].info()
print('-'*50)
data1_dummy.info()結(jié)果:
<class?'pandas.core.frame.DataFrame'>
RangeIndex:891?entries,?0?to?890
Data?columns?(total?8?columns):
Sex_Code?????????891?non-null?int64
Pclass???????????891?non-null?int64
Embarked_Code????891?non-null?int64
Title_Code???????891?non-null?int64
SibSp????????????891?non-null?int64
Parch????????????891?non-null?int64
Age??????????????891?non-null?float64
Fare?????????????891?non-null?float64
dtypes:?float64(2),?int64(6)
memory?usage:?55.8?KB
--------------------------------------------------
<class?'pandas.core.frame.DataFrame'>
RangeIndex:891?entries,?0?to?890
Data?columns?(total?29?columns):
Pclass????????????????891?non-null?int64
SibSp?????????????????891?non-null?int64
Parch?????????????????891?non-null?int64
Age???????????????????891?non-null?float64
Fare??????????????????891?non-null?float64
FamilySize????????????891?non-null?int64
IsAlone???????????????891?non-null?int64
Sex_female????????????891?non-null?uint8
Sex_male??????????????891?non-null?uint8
Embarked_C????????????891?non-null?uint8
Embarked_Q????????????891?non-null?uint8
Embarked_S????????????891?non-null?uint8
Title_Capt????????????891?non-null?uint8
Title_Col?????????????891?non-null?uint8
Title_Don?????????????891?non-null?uint8
Title_Dr??????????????891?non-null?uint8
Title_Jonkheer????????891?non-null?uint8
Title_Lady????????????891?non-null?uint8
Title_Major???????????891?non-null?uint8
Title_Master??????????891?non-null?uint8
Title_Miss????????????891?non-null?uint8
Title_Mlle????????????891?non-null?uint8
Title_Mme?????????????891?non-null?uint8
Title_Mr??????????????891?non-null?uint8
Title_Mrs?????????????891?non-null?uint8
Title_Ms??????????????891?non-null?uint8
Title_Rev?????????????891?non-null?uint8
Title_Sir?????????????891?non-null?uint8
Title_the?Countess????891?non-null?uint8
dtypes:?float64(2),?int64(5),?uint8(22)
memory?usage:?68.0?KB
Great !
Done~?


