1. <strong id="7actg"></strong>
    2. <table id="7actg"></table>

    3. <address id="7actg"></address>
      <address id="7actg"></address>
      1. <object id="7actg"><tt id="7actg"></tt></object>

        清洗數(shù)據(jù),高效率7步走!

        共 10933字,需瀏覽 22分鐘

         ·

        2020-10-08 07:36


        導(dǎo)讀:本文使用泰坦尼克數(shù)據(jù)集,完整介紹數(shù)據(jù)清洗 7 步的具體操作過(guò)程。


        作者:zglg
        來(lái)源:Python與算法社區(qū)(ID:alg-channel)





        數(shù)據(jù)清洗 (data cleaning) 是機(jī)器學(xué)習(xí)和深度學(xué)習(xí)進(jìn)入算法步前的一項(xiàng)重要任務(wù),我平時(shí)比較習(xí)慣使用的 7 個(gè)步驟,總結(jié)如下:


        • Step1 : read csv

        • Step2 : preview data

        • Step3: check null value for every column

        • Step4: complete null value

        • Step5: feature engineering

          • Step 5.1: delete some features

          • Step 5.2: create new feature

        • Step6: encode for categories columns

          • Step 6.1: Sklearn LabelEncode

          • Step 6.2: Pandas get_dummies

        • Step 7: check for data cleaning



        01 讀入數(shù)據(jù)


        這不廢話嗎,第一步就是讀入數(shù)據(jù)。


        data_raw?=?pd.read_csv('../input/titanicdataset-traincsv/train.csv')
        data_raw


        結(jié)果:


        ????PassengerId?Survived????Pclass??Name????Sex?Age?SibSp???Parch???Ticket??Fare????Cabin???Embarked
        0????1???0???3???Braund,?Mr.?Owen?Harris?male????22.0????1???0???A/5?21171???7.2500??NaN?S
        1????2???1???1???Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.0????1???0???PC?17599????71.2833?C85?C
        2????3???1???3???Heikkinen,?Miss.?Laina??female??26.0????0???0???STON/O2.?3101282????7.9250??NaN?S
        3????4???1???1???Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.0????1???0???113803??53.1000?C123????S
        4????5???0???3???Allen,?Mr.?William?Henry????male????35.0????0???0???373450??8.0500??NaN?S
        ...????...?...?...?...?...?...?...?...?...?...?...?...
        886????887?0???2???Montvila,?Rev.?Juozas???male????27.0????0???0???211536??13.0000?NaN?S
        887????888?1???1???Graham,?Miss.?Margaret?Edith????female??19.0????0???0???112053??30.0000?B42?S
        888????889?0???3???Johnston,?Miss.?Catherine?Helen?"Carrie"????female??NaN?1???2???W./C.?6607??23.4500?NaN?S
        889????890?1???1???Behr,?Mr.?Karl?Howell???male????26.0????0???0???111369??30.0000?C148????C
        890????891?0???3???Dooley,?Mr.?Patrick?male????32.0????0???0???370376??7.7500??NaN?Q
        891?rows?×?12?columns



        02 數(shù)據(jù)預(yù)覽


        data_raw.info()
        data_raw.describe(include='all')


        結(jié)果:


        <class?'pandas.core.frame.DataFrame'>
        RangeIndex:
        891?entries,?0?to?890
        Data?columns?(total?12?columns):
        PassengerId????891?non-null?int64
        Survived???????891?non-null?int64
        Pclass?????????891?non-null?int64
        Name???????????891?non-null?object
        Sex????????????891?non-null?object
        Age????????????714?non-null?float64
        SibSp??????????891?non-null?int64
        Parch??????????891?non-null?int64
        Ticket?????????891?non-null?object
        Fare???????????891?non-null?float64
        Cabin??????????204?non-null?object
        Embarked???????889?non-null?object
        dtypes:?float64(2),?int64(5),?object(5)
        memory?usage:?83.7+?KB
        PassengerId????Survived????Pclass??Name????Sex?Age?SibSp???Parch???Ticket??Fare????Cabin???Embarked
        count????891.000000891.000000891.000000891891714.000000891.000000891.000000891891.000000204889
        unique????NaN?NaN?NaN?8912????NaN?NaN?NaN?681?NaN?1473
        top????NaN?NaN?NaN?Hakkarainen,?Mr.?Pekka?Pietari??male????NaN?NaN?NaN?1601????NaN?G6??S
        freq????NaN?NaN?NaN?1577????NaN?NaN?NaN?7???NaN?4644
        mean????446.0000000.3838382.308642??NaN?NaN?29.6991180.5230080.381594???NaN?32.204208???NaN?NaN
        std????257.3538420.4865920.836071??NaN?NaN?14.5264971.1027430.806057???NaN?49.693429???NaN?NaN
        min????1.0000000.0000001.000000????NaN?NaN?0.4200000.0000000.000000????NaN?0.000000????NaN?NaN
        25%????223.5000000.0000002.000000??NaN?NaN?20.1250000.0000000.000000???NaN?7.910400????NaN?NaN
        50%????446.0000000.0000003.000000??NaN?NaN?28.0000000.0000000.000000???NaN?14.454200???NaN?NaN
        75%????668.5000001.0000003.000000??NaN?NaN?38.0000001.0000000.000000???NaN?31.000000???NaN?NaN
        max????891.0000001.0000003.000000??NaN?NaN?80.0000008.0000006.000000???NaN?512.329200??NaN?N



        03 檢查null值


        data1?=?data_raw.copy(deep=True)

        data1.isnull().sum()


        結(jié)果:


        PassengerId??????0
        Survived?????????0
        Pclass???????????0
        Name?????????????0
        Sex??????????????0
        Age????????????177
        SibSp????????????0
        Parch????????????0
        Ticket???????????0
        Fare?????????????0
        Cabin??????????687
        Embarked?????????2
        dtype:?int64


        Age 列 177 個(gè)空值,Cabin 687 個(gè)空值,一共才 891 行,估計(jì)沒(méi)啥價(jià)值了!Embarked 2 個(gè)。



        04 補(bǔ)全空值


        data1['Age'].fillna(data1['Age'].median(),?inplace?=?True)
        data1['Embarked'].fillna(data1['Embarked'].mode()[0],?inplace?=?True)

        data1.isnull().sum()


        補(bǔ)全操作check:


        PassengerId??????0
        Survived?????????0
        Pclass???????????0
        Name?????????????0
        Sex??????????????0
        Age??????????????0
        SibSp????????????0
        Parch????????????0
        Ticket???????????0
        Fare?????????????0
        Cabin??????????687
        Embarked?????????0
        dtype:?int64



        05 特征工程


        1. 干掉 3 列


        drop_column?=?['PassengerId','Cabin',?'Ticket']
        data1.drop(drop_column,?axis=1,?inplace?=?True)


        2. 增加 3 列

        增加一列?FamilySize

        data1['FamilySize']?=?data1?['SibSp']?+?data1['Parch']?+?1
        data1


        打印結(jié)果:


        Survived????Pclass??Name????Sex?Age?SibSp???Parch???Fare????Embarked????FamilySize
        003????Braund,?Mr.?Owen?Harris?male????22.0107.2500????S???2
        111????Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.01071.2833???C???2
        213????Heikkinen,?Miss.?Laina??female??26.0007.9250????S???1
        311????Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.01053.1000???S???2
        403????Allen,?Mr.?William?Henry????male????35.0008.0500????S???1
        ...????...?...?...?...?...?...?...?...?...?...
        88602????Montvila,?Rev.?Juozas???male????27.00013.0000???S???1
        88711????Graham,?Miss.?Margaret?Edith????female??19.00030.0000???S???1
        88803????Johnston,?Miss.?Catherine?Helen?"Carrie"????female??28.01223.4500???S???4
        88911????Behr,?Mr.?Karl?Howell???male????26.00030.0000???C???1
        89003????Dooley,?Mr.?Patrick?male????32.0007.7500????Q???1
        891?rows?×?10?columns


        再創(chuàng)建一列:


        data1['IsAlone']?=?np.where(data1['FamilySize']?>?1,0,1)


        再創(chuàng)建一列:


        data1['Title']?=?data1['Name'].str.split(",?",?expand=True)[1].str.split(".",?expand=True)[0]
        data1


        結(jié)果:


        Survived????Pclass??Name????Sex?Age?SibSp???Parch???Fare????Embarked????FamilySize??IsAlone?Title
        003????Braund,?Mr.?Owen?Harris?male????22.0107.2500????S???20??Mr
        111????Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.01071.2833???C???20??Mrs
        213????Heikkinen,?Miss.?Laina??female??26.0007.9250????S???11??Miss
        311????Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.01053.1000???S???20??Mrs
        403????Allen,?Mr.?William?Henry????male????35.0008.0500????S???11??Mr
        ...????...?...?...?...?...?...?...?...?...?...?...?...
        88602????Montvila,?Rev.?Juozas???male????27.00013.0000???S???11??Rev
        88711????Graham,?Miss.?Margaret?Edith????female??19.00030.0000???S???11??Miss
        88803????Johnston,?Miss.?Catherine?Helen?"Carrie"????female??28.01223.4500???S???40??Miss
        88911????Behr,?Mr.?Karl?Howell???male????26.00030.0000???C???11??Mr
        89003????Dooley,?Mr.?Patrick?male????32.0007.7500????Q???11??Mr
        891?rows?×?12?columns


        3. 分箱走起


        data1['FareCut']?=?pd.qcut(data1['Fare'],?4)
        data1['AgeCut']?=?pd.cut(data1['Age'].astype(int),?6)
        data1


        結(jié)果:


        ????Survived????Pclass??Name????Sex?Age?SibSp???Parch???Fare????Embarked????FamilySize??IsAlone?Title???FareCut?AgeCut
        0????0???3???Braund,?Mr.?Owen?Harris?male????22.0????1???0???7.2500??S???2???0???Mr??(-0.001,?7.91]??(13.333,?26.667]
        1????1???1???Cumings,?Mrs.?John?Bradley?(Florence?Briggs?Th...???female??38.0????1???0???71.2833?C???2???0???Mrs?(31.0,?512.329]?(26.667,?40.0]
        2????1???3???Heikkinen,?Miss.?Laina??female??26.0????0???0???7.9250??S???1???1???Miss????(7.91,?14.454]??(13.333,?26.667]
        3????1???1???Futrelle,?Mrs.?Jacques?Heath?(Lily?May?Peel)????female??35.0????1???0???53.1000?S???2???0???Mrs?(31.0,?512.329]?(26.667,?40.0]
        4????0???3???Allen,?Mr.?William?Henry????male????35.0????0???0???8.0500??S???1???1???Mr??(7.91,?14.454]??(26.667,?40.0]
        ...????...?...?...?...?...?...?...?...?...?...?...?...?...?...
        886????0???2???Montvila,?Rev.?Juozas???male????27.0????0???0???13.0000?S???1???1???Rev?(7.91,?14.454]??(26.667,?40.0]
        887????1???1???Graham,?Miss.?Margaret?Edith????female??19.0????0???0???30.0000?S???1???1???Miss????(14.454,?31.0]??(13.333,?26.667]
        888????0???3???Johnston,?Miss.?Catherine?Helen?"Carrie"????female??28.0????1???2???23.4500?S???4???0???Miss????(14.454,?31.0]??(26.667,?40.0]
        889????1???1???Behr,?Mr.?Karl?Howell???male????26.0????0???0???30.0000?C???1???1???Mr??(14.454,?31.0]??(13.333,?26.667]
        890????0???3???Dooley,?Mr.?Patrick?male????32.0????0???0???7.7500??Q???1???1???Mr??(-0.001,?7.91]??(26.667,?40.0]
        891?rows?×?14?columns



        06 編碼


        1. LabelEncoder 方法


        使用 Sklearn 的 LabelEncoder


        from?sklearn.preprocessing?import?LabelEncoder


        label?=?LabelEncoder()
        data1['Sex_Code']?=?label.fit_transform(data1['Sex'])
        data1['Embarked_Code']?=?label.fit_transform(data1['Embarked'])
        data1['Title_Code']?=?label.fit_transform(data1['Title'])
        data1['AgeBin_Code']?=?label.fit_transform(data1['AgeCut'])
        data1['FareBin_Code']?=?label.fit_transform(data1['FareCut'])
        data1


        結(jié)果 data1 選取某些列,算法模型終于能認(rèn)出它們了,多不容易!


        2. get_dummies 方法


        get_dummies 將長(zhǎng) DataFrame 變?yōu)閷?DataFrame:


        pd.get_dummies(data1['Sex'])


        結(jié)果:


        female????male
        0????0???1
        1????1???0
        2????1???0
        3????1???0
        4????0???1
        ...????...?...
        886????0???1
        887????1???0
        888????1???0
        889????0???1
        890????0???1
        891?rows?×?2?columns


        而 LabelEncoder 編碼后,僅僅是把 Female 編碼為 0, male 編碼為 1.


        label.fit_transform(data1['Sex'])
        0??????1
        1??????0
        2??????0
        3??????0
        4??????1
        ??????..
        886????1
        887????0
        888????0
        889????1
        890????1
        Name:?Sex_Code,?Length:?891,?dtype:?int64



        07 再 check


        #?Step?7:?data?cleaning?check
        data1[data1_x_alg].info()
        print('-'*50)
        data1_dummy.info()


        結(jié)果:


        <class?'pandas.core.frame.DataFrame'>
        RangeIndex:
        891?entries,?0?to?890
        Data?columns?(total?8?columns):
        Sex_Code?????????891?non-null?int64
        Pclass???????????891?non-null?int64
        Embarked_Code????891?non-null?int64
        Title_Code???????891?non-null?int64
        SibSp????????????891?non-null?int64
        Parch????????????891?non-null?int64
        Age??????????????891?non-null?float64
        Fare?????????????891?non-null?float64
        dtypes:?float64(2),?int64(6)
        memory?usage:?55.8?KB
        --------------------------------------------------
        <class?'pandas.core.frame.DataFrame'>
        RangeIndex:
        891?entries,?0?to?890
        Data?columns?(total?29?columns):
        Pclass????????????????891?non-null?int64
        SibSp?????????????????891?non-null?int64
        Parch?????????????????891?non-null?int64
        Age???????????????????891?non-null?float64
        Fare??????????????????891?non-null?float64
        FamilySize????????????891?non-null?int64
        IsAlone???????????????891?non-null?int64
        Sex_female????????????891?non-null?uint8
        Sex_male??????????????891?non-null?uint8
        Embarked_C????????????891?non-null?uint8
        Embarked_Q????????????891?non-null?uint8
        Embarked_S????????????891?non-null?uint8
        Title_Capt????????????891?non-null?uint8
        Title_Col?????????????891?non-null?uint8
        Title_Don?????????????891?non-null?uint8
        Title_Dr??????????????891?non-null?uint8
        Title_Jonkheer????????891?non-null?uint8
        Title_Lady????????????891?non-null?uint8
        Title_Major???????????891?non-null?uint8
        Title_Master??????????891?non-null?uint8
        Title_Miss????????????891?non-null?uint8
        Title_Mlle????????????891?non-null?uint8
        Title_Mme?????????????891?non-null?uint8
        Title_Mr??????????????891?non-null?uint8
        Title_Mrs?????????????891?non-null?uint8
        Title_Ms??????????????891?non-null?uint8
        Title_Rev?????????????891?non-null?uint8
        Title_Sir?????????????891?non-null?uint8
        Title_the?Countess????891?non-null?uint8
        dtypes:?float64(2),?int64(5),?uint8(22)
        memory?usage:?68.0?KB


        Great !


        Done~?



        劃重點(diǎn)?


        干貨直達(dá)?


        更多精彩?

        在公眾號(hào)對(duì)話框輸入以下關(guān)鍵詞
        查看更多優(yōu)質(zhì)內(nèi)容!

        PPT?|?讀書(shū)?|?書(shū)單?|?硬核?|?干貨?|?講明白?|?神操作
        大數(shù)據(jù)?|?云計(jì)算?|?數(shù)據(jù)庫(kù)?|?Python?|?可視化
        AI?|?人工智能?|?機(jī)器學(xué)習(xí)?|?深度學(xué)習(xí)?|?NLP
        5G?|?中臺(tái)?|?用戶畫(huà)像?|?1024?|?數(shù)學(xué)?|?算法?|?數(shù)字孿生

        據(jù)統(tǒng)計(jì),99%的大咖都完成了這個(gè)神操作
        ?


        瀏覽 34
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        評(píng)論
        圖片
        表情
        推薦
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        1. <strong id="7actg"></strong>
        2. <table id="7actg"></table>

        3. <address id="7actg"></address>
          <address id="7actg"></address>
          1. <object id="7actg"><tt id="7actg"></tt></object>
            园产乱人乱偷精品视频 | 超碰永久 | 午夜激情婷婷 | japanese丰满奶水 | 婷婷五月天在线播放 | 亚洲人一级电影 | 欧美日韩国产成人免费 | 巨大黑人videos精品hd | 性爽交视频 | 天天干在线播放 |