1. <strong id="7actg"></strong>
    2. <table id="7actg"></table>

    3. <address id="7actg"></address>
      <address id="7actg"></address>
      1. <object id="7actg"><tt id="7actg"></tt></object>

        Kaggle知識(shí)點(diǎn):類別變量處理與精度對(duì)比

        共 5024字,需瀏覽 11分鐘

         ·

        2024-05-16 17:30

           
        來源:Coggle數(shù)據(jù)科學(xué)

        本文約1200字,建議閱讀4分鐘

        本文將使用埃姆斯愛荷華州房屋數(shù)據(jù)集進(jìn)行房?jī)r(jià)分析。


        在這個(gè)例子中,我們將比較使用不同的編碼策略來處理分類特征時(shí),HistGradientBoostingRegressor 的訓(xùn)練時(shí)間和預(yù)測(cè)性能。具體來說,我們將評(píng)估以下幾種方法:

        • 刪除分類特征;
        • 使用 OneHotEncoder;
        • 使用 OrdinalEncoder,將分類特征視為有序、等距的量;
        • 使用 OrdinalEncoder,并依賴于 HistGradientBoostingRegressor 估計(jì)器的原生類別支持。

        我們將使用埃姆斯愛荷華州房屋數(shù)據(jù)集進(jìn)行工作,該數(shù)據(jù)集包含數(shù)值和分類特征,其中房屋銷售價(jià)格是目標(biāo)變量。

        步驟1:加載數(shù)據(jù)集

            
        from sklearn.datasets import fetch_openml
        X, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True)
        # Select only a subset of features of X to make the example faster to runcategorical_columns_subset = [ "BldgType", "GarageFinish", "LotConfig", "Functional", "MasVnrType", "HouseStyle", "FireplaceQu", "ExterCond", "ExterQual", "PoolQC",]
        numerical_columns_subset = [ "3SsnPorch", "Fireplaces", "BsmtHalfBath", "HalfBath", "GarageCars", "TotRmsAbvGrd", "BsmtFinSF1", "BsmtFinSF2", "GrLivArea", "ScreenPorch",]
        X = X[categorical_columns_subset + numerical_columns_subset]X[categorical_columns_subset] = X[categorical_columns_subset].astype("category")
        categorical_columns = X.select_dtypes(include="category").columnsn_categorical_features = len(categorical_columns)n_numerical_features = X.select_dtypes(include="number").shape[1]
        print(f"Number of samples: {X.shape[0]}")print(f"Number of features: {X.shape[1]}")print(f"Number of categorical features: {n_categorical_features}")print(f"Number of numerical features: {n_numerical_features}")


        步驟2:基準(zhǔn)模型(刪除類別變量)

            
        from sklearn.compose import make_column_selector, make_column_transformerfrom sklearn.ensemble import HistGradientBoostingRegressorfrom sklearn.pipeline import make_pipeline
        dropper = make_column_transformer( ("drop", make_column_selector(dtype_include="category")), remainder="passthrough")hist_dropped = make_pipeline(dropper, HistGradientBoostingRegressor(random_state=42))


        步驟3:OneHot類別變量

        from sklearn.preprocessing import OneHotEncoder
        one_hot_encoder = make_column_transformer( ( OneHotEncoder(sparse_output=False, handle_unknown="ignore"), make_column_selector(dtype_include="category"), ), remainder="passthrough",)
        hist_one_hot = make_pipeline( one_hot_encoder, HistGradientBoostingRegressor(random_state=42))

        步驟4:Ordinal類別變量

            
        import numpy as np
        from sklearn.preprocessing import OrdinalEncoder
        ordinal_encoder = make_column_transformer( ( OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan), make_column_selector(dtype_include="category"), ), remainder="passthrough", # Use short feature names to make it easier to specify the categorical # variables in the HistGradientBoostingRegressor in the next step # of the pipeline. verbose_feature_names_out=False,)
        hist_ordinal = make_pipeline( ordinal_encoder, HistGradientBoostingRegressor(random_state=42))


        步驟5:原生類別支持

            
        hist_native = HistGradientBoostingRegressor(    random_state=42, categorical_features="from_dtype")


        步驟6:對(duì)比模型速度和精度

            
        from sklearn.model_selection import cross_validate
        scoring = "neg_mean_absolute_percentage_error"n_cv_folds = 3
        dropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring)one_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring)ordinal_result = cross_validate(hist_ordinal, X, y, cv=n_cv_folds, scoring=scoring)native_result = cross_validate(hist_native, X, y, cv=n_cv_folds, scoring=scoring)


        我們可以觀察到,使用獨(dú)熱編碼的模型明顯最慢。這是可以預(yù)期的,因?yàn)楠?dú)熱編碼為每個(gè)類別值(對(duì)于每個(gè)分類特征)創(chuàng)建了一個(gè)額外的特征,因此在擬合過程中需要考慮更多的分裂點(diǎn)。

        理論上,我們預(yù)期原生處理分類特征的速度會(huì)略慢于將類別視為有序量('Ordinal'),因?yàn)樵幚硇枰獙?duì)類別進(jìn)行排序。

        通常情況下,可以預(yù)期使用獨(dú)熱編碼的數(shù)據(jù)會(huì)導(dǎo)致更差的預(yù)測(cè)結(jié)果,特別是當(dāng)樹的深度或節(jié)點(diǎn)數(shù)量受限時(shí):使用獨(dú)熱編碼的數(shù)據(jù),需要更多的分裂點(diǎn),即更深的樹,才能恢復(fù)相當(dāng)于原生處理中的一個(gè)單一分裂點(diǎn)所能獲得的等效分裂。

        當(dāng)類別被視為有序量時(shí),這一點(diǎn)同樣適用:如果類別為A..F,最佳分裂是ACF - BDE,則獨(dú)熱編碼模型將需要3個(gè)分裂點(diǎn)(左節(jié)點(diǎn)中的每個(gè)類別一個(gè)),而非原生模型將需要4個(gè)分裂:1個(gè)分裂來隔離A,1個(gè)分裂來隔離F,以及2個(gè)分裂來從BCDE中隔離C。

        代碼鏈接:https://scikit-learn.org/stable/auto_examples/ensemble/plot_gradient_boosting_categorical.html


        編輯:黃繼彥

        瀏覽 30
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        評(píng)論
        圖片
        表情
        推薦
        點(diǎn)贊
        評(píng)論
        收藏
        分享

        手機(jī)掃一掃分享

        分享
        舉報(bào)
        1. <strong id="7actg"></strong>
        2. <table id="7actg"></table>

        3. <address id="7actg"></address>
          <address id="7actg"></address>
          1. <object id="7actg"><tt id="7actg"></tt></object>
            欧美有码在线观看 | 国产又粗又硬又猛的免费视频 | 日本美女性爱视频 | 手机天堂AV | 《豪妇荡乳》无删减版在线观看 | 欧美污视频网站 | 大胸美女被人操 | 国产精品久久免费视频 | 日本三级韩国三级美三级91 | 极品少妇啪啪 |