[012] 不同數(shù)據(jù)集劃分與驗證方法的實現(xiàn)與比較

“哈哈,我們在訓練我們的模型并且希望得到更加準確的結(jié)果,但基于實際的情況(比如算力、時間),往往會按照一定策略來選擇。本文介紹了幾種常見的數(shù)據(jù)集劃分與交叉驗證的方法策略以及它們的優(yōu)缺點,主要包括了Train-test-split、k-fold cross-validation、Leave One Out Cross-validation等,包括了代碼層的實現(xiàn)與效果的比較,比較適合綜合閱讀一次。
What is Model evaluation?
Model evaluation is a set of procedures allowing you to pick the best possible stable model. It is an essential part of the model development process. It reveals the model’s behavior and its predictive power — indicates the balance between bias and variance on unseen data. As a starting point, split the given dataset into a train and test set. The model will learn to predict using the train set; in comparison, we will utilize the test set to assess the model’s performance.
Train-test split k-fold cross-validation, K-Fold Leave One Out Cross-validation, LOOCV
Methods used for splitting
There are different strategies to split the data and make sure that it is done fairly taking into consideration the special characteristics the attributes could have. For example, you could have biased predictions if the original data has an imbalance between features, so for each case, a specific method might be recommended.
The main methods that would be covered in this article are as the following:
Train-test split k-fold cross-validation, K-Fold Leave One Out Cross-validation, LOOCV
Train test split

regular train-test split using sklearn — image by the author
It is a way to split the dataset into two halves with a specific percentage. It is easy and quick. It might be appropriate to use when comparing different algorithms to decide which one you might consider.
The train_test_split method within the sklearn. model_selection module is widely utilized to split the original dataset. A common split ratio is 80/20 or 70/30.
“You can split the training set into train and validation set with the same split ratio above — Stackoverflow discussion
I did use stratify here because the original dataset has an imbalance in the target class — 500/268.
#?set?the?seed?to?keep?code?reducibility
seed?=?7#?Apply?the?splitting
x_train,?x_test,?y_train,?y_test?=?train_test_split(
??????????????X,?y,?
??????????????test_size?=?0.33,?????
??????????????stratify?=?y,?#to?keep?class?balance?during?splitting
??????????????random_state?=?seed?
)
model?=?LogisticRegression(max_iter=1000)
model.fit(x_train,?y_train)
result?=?model.score(x_test,?y_test)
print(f'accuracy?is:?{result*100.0:?0.3f}')
“The resulted accuracy is: 76.378
Pros:
Easy to implement Quick execution, less computation time
Cons:
Inaccurate accuracy if the split is not random Might be a cause for underfitting if the original dataset has limited data points.
K-fold cross-validation

k-fold split procedure — image by the author
To enhance the model accuracy and avoid the regular split of the data disadvantages, we need to add more generalization to the split process. In this strategy, we are repeating the train_test_split multiple times randomly. For each split or fold, the accuracy is calculated then the algorithm aggregate the accuracies calculated from each split and averages them. That way, all the dataset points are involving in measuring the model accuracy, which is better.
For this example, we will use the RepeatedStratifiedKFold() within the sklearn library to assess the model since it repeats stratified folds n-times with a different random scheme in each iteration.
from?sklearn.model_selection?import?RepeatedStratifiedKFold
from?scipy.stats?import?sem
import?numpy?as?np
import?matplotlib.pyplot?as?pltcv_repeated?=?RepeatedStratifiedKFold(
????n_splits?=?10,
????n_repeats?=?16,
????random_state?=?seed
)scores_r?=?cross_val_score(
?????model,
?????X,?y,
?????scoring?=?'accuracy',
?????cv?=?cv_repeated,
?????n_jobs?=?-1
)print('Accuracy:?%.3f?(%.3f)'?%?(scores_r.mean(),?scores_r.std())))
“The resulted accuracy is: 0.775 (0.042)
Accessing the model accuracies across each fold
It is a good idea to investigate more on the distribution of the estimates for better judgments.
#?evaluate?a?model?with?a?given?number?of?repeats
def?asses_model(X,?y,?repeats):
??#?define?the?cv?folds?
??cv?=?RepeatedStratifiedKFold(
?????????????n_splits=10,?
?????????????n_repeats=repeats,??
?????????????random_state?=?seed)
??#?create?the?model?
??model?=?LogisticRegression(max_iter=1000)
??#?evaluate?model?
??scores?=?cross_val_score(
?????????????model,?
?????????????X,?y,?
?????????????scoring?=?'accuracy',?
?????????????cv?=?cv,?
?????????????n_jobs=-1)
??return?scores
Then we will use the sem() method from the scipy library to calculate the standard error for each sample.
repeats?=?range(1,?16)
res?=?list()
for?rep?in?repeats:
??scores?=?asses_model(X,?y,?rep)
??print('Sample_%d?mean=%.4f?se=%.3f'?%?(rep,?np.mean(scores),?sem(scores)))
??res.append(scores)

Let’s visualize the samples accuracies with a boxplot to better understand the results

accuracy across splits — image by the author
The orange line represents the median of the distribution of the accuracy while the green triangle indicates the arithmetic mean.
As demonstrated in the graph above, the model accuracy stabilizes around 6 and 7, which is the number of folds to harness (0.775 (0.042) accuracy).
Pros:
Higher accuracy Handles class imbalances better. less probability of underfitting
cons:
More prone to overfitting, so we need to monitor the accuracies across folds. High computational power and more execution time.
Leave-One-Out Cross-validation

leave one out cross-validation — image by the author
In this strategy, The algorithm picks a data point for each training fold and excludes it while model training. The validation set hence used to calculate the model the accuracy; then repeat this process for each training fold. The final accuracy has been calculated after averaging each fold accuracy.
“In this strategy, we create n-1 models for n observations in the data.
from?sklearn.model_selection?import?LeaveOneOut
loocv?=?LeaveOneOut()
model?=?LogisticRegression(max_iter=1000)
res?=?cross_val_score(model,?X,?y,?cv?=?loocv)
print('Accuracy:?%.3f?(%.3f)'?%?(np.mean(res),?np.std(res)))
“The resulted accuracy is: 0.776 (0.417)
Pros:
Very efficient if the dataset is limited — since we want to use as much training data as possible when fitting the model. It has the best error estimate possible for a single new data point.
cons:
Computationally expensive. If the dataset is large in size. If testing a lot of different parameter sets.
The best way to test whether to use LOOCV or not is to run KFold-CV with a large k value — consider 25 or 50, and gauge how long it would take to train the model.
Takeaways and Closing notes
We explored the most common strategies to train the model in machine learning effectively. Each method has its pros and cons; however, there are some tips that we may consider when choosing one.
K-fold cross-validation is a rule of thumb for comparing different algorithms’ performance — most k-fold is 3, 5, and 10. Start with the regular train test split to have a ground truth of a specific algorithm’s estimated performance. Leave one out cross-validation — LOOCV is a deterministic estimation, where there is no sampling on the training dataset. On the other hand, other strategies follow a stochastic estimate. LOOCV might be appropriate when you need an accurate estimate of the performance.
