Kaggle知識(shí)點(diǎn):類別變量處理與精度對(duì)比
共 5024字,需瀏覽 11分鐘
·
2024-05-16 17:30
來源:Coggle數(shù)據(jù)科學(xué) 本文約1200字,建議閱讀4分鐘
本文將使用埃姆斯愛荷華州房屋數(shù)據(jù)集進(jìn)行房?jī)r(jià)分析。
在這個(gè)例子中,我們將比較使用不同的編碼策略來處理分類特征時(shí),HistGradientBoostingRegressor 的訓(xùn)練時(shí)間和預(yù)測(cè)性能。具體來說,我們將評(píng)估以下幾種方法:
-
刪除分類特征; -
使用 OneHotEncoder; -
使用 OrdinalEncoder,將分類特征視為有序、等距的量; 使用 OrdinalEncoder,并依賴于 HistGradientBoostingRegressor 估計(jì)器的原生類別支持。
我們將使用埃姆斯愛荷華州房屋數(shù)據(jù)集進(jìn)行工作,該數(shù)據(jù)集包含數(shù)值和分類特征,其中房屋銷售價(jià)格是目標(biāo)變量。
步驟1:加載數(shù)據(jù)集
from sklearn.datasets import fetch_openml
X, y = fetch_openml(data_id=42165, as_frame=True, return_X_y=True)
# Select only a subset of features of X to make the example faster to runcategorical_columns_subset = [ "BldgType", "GarageFinish", "LotConfig", "Functional", "MasVnrType", "HouseStyle", "FireplaceQu", "ExterCond", "ExterQual", "PoolQC",]
numerical_columns_subset = [ "3SsnPorch", "Fireplaces", "BsmtHalfBath", "HalfBath", "GarageCars", "TotRmsAbvGrd", "BsmtFinSF1", "BsmtFinSF2", "GrLivArea", "ScreenPorch",]
X = X[categorical_columns_subset + numerical_columns_subset]X[categorical_columns_subset] = X[categorical_columns_subset].astype("category")
categorical_columns = X.select_dtypes(include="category").columnsn_categorical_features = len(categorical_columns)n_numerical_features = X.select_dtypes(include="number").shape[1]
print(f"Number of samples: {X.shape[0]}")print(f"Number of features: {X.shape[1]}")print(f"Number of categorical features: {n_categorical_features}")print(f"Number of numerical features: {n_numerical_features}")
步驟2:基準(zhǔn)模型(刪除類別變量)
from sklearn.compose import make_column_selector, make_column_transformerfrom sklearn.ensemble import HistGradientBoostingRegressorfrom sklearn.pipeline import make_pipeline
dropper = make_column_transformer( ("drop", make_column_selector(dtype_include="category")), remainder="passthrough")hist_dropped = make_pipeline(dropper, HistGradientBoostingRegressor(random_state=42))
步驟3:OneHot類別變量
from sklearn.preprocessing import OneHotEncoderone_hot_encoder = make_column_transformer((OneHotEncoder(sparse_output=False, handle_unknown="ignore"),make_column_selector(dtype_include="category"),),remainder="passthrough",)hist_one_hot = make_pipeline(one_hot_encoder, HistGradientBoostingRegressor(random_state=42))
步驟4:Ordinal類別變量
import numpy as np
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder = make_column_transformer( ( OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=np.nan), make_column_selector(dtype_include="category"), ), remainder="passthrough", # Use short feature names to make it easier to specify the categorical # variables in the HistGradientBoostingRegressor in the next step # of the pipeline. verbose_feature_names_out=False,)
hist_ordinal = make_pipeline( ordinal_encoder, HistGradientBoostingRegressor(random_state=42))
步驟5:原生類別支持
hist_native = HistGradientBoostingRegressor( random_state=42, categorical_features="from_dtype")
步驟6:對(duì)比模型速度和精度
from sklearn.model_selection import cross_validate
scoring = "neg_mean_absolute_percentage_error"n_cv_folds = 3
dropped_result = cross_validate(hist_dropped, X, y, cv=n_cv_folds, scoring=scoring)one_hot_result = cross_validate(hist_one_hot, X, y, cv=n_cv_folds, scoring=scoring)ordinal_result = cross_validate(hist_ordinal, X, y, cv=n_cv_folds, scoring=scoring)native_result = cross_validate(hist_native, X, y, cv=n_cv_folds, scoring=scoring)
我們可以觀察到,使用獨(dú)熱編碼的模型明顯最慢。這是可以預(yù)期的,因?yàn)楠?dú)熱編碼為每個(gè)類別值(對(duì)于每個(gè)分類特征)創(chuàng)建了一個(gè)額外的特征,因此在擬合過程中需要考慮更多的分裂點(diǎn)。
理論上,我們預(yù)期原生處理分類特征的速度會(huì)略慢于將類別視為有序量('Ordinal'),因?yàn)樵幚硇枰獙?duì)類別進(jìn)行排序。
通常情況下,可以預(yù)期使用獨(dú)熱編碼的數(shù)據(jù)會(huì)導(dǎo)致更差的預(yù)測(cè)結(jié)果,特別是當(dāng)樹的深度或節(jié)點(diǎn)數(shù)量受限時(shí):使用獨(dú)熱編碼的數(shù)據(jù),需要更多的分裂點(diǎn),即更深的樹,才能恢復(fù)相當(dāng)于原生處理中的一個(gè)單一分裂點(diǎn)所能獲得的等效分裂。
當(dāng)類別被視為有序量時(shí),這一點(diǎn)同樣適用:如果類別為A..F,最佳分裂是ACF - BDE,則獨(dú)熱編碼模型將需要3個(gè)分裂點(diǎn)(左節(jié)點(diǎn)中的每個(gè)類別一個(gè)),而非原生模型將需要4個(gè)分裂:1個(gè)分裂來隔離A,1個(gè)分裂來隔離F,以及2個(gè)分裂來從BCDE中隔離C。
編輯:黃繼彥
