Machine Learning Tutorial Part 11: Cross-Validation Techniques
๐ What is Cross-Validation?
Cross-validation is a technique to evaluate the generalization performance of a model. It splits the dataset into training and validation sets multiple times, ensuring the model is not overfitting.
๐ K-Fold Cross-Validation
K-Fold splits the data into k subsets (folds). The model trains on k-1 folds and validates on the remaining fold. This repeats k times.
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
kf = KFold(n_splits=5)
model = LogisticRegression()
accuracies = []
for train_index, val_index in kf.split(X):
X_train, X_val = X[train_index], X[val_index]
y_train, y_val = y[train_index], y[val_index]
model.fit(X_train, y_train)
preds = model.predict(X_val)
accuracies.append(accuracy_score(y_val, preds))
print("Average Accuracy:", np.mean(accuracies))
๐ฏ Stratified K-Fold
Stratified K-Fold ensures each fold has approximately the same class distribution. It's ideal for classification tasks with imbalanced data.
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_idx, test_idx in skf.split(X, y):
# Same training logic as above
pass
๐งช Leave-One-Out Cross-Validation (LOOCV)
Each sample is used once as a validation set while the rest form the training set. It's very accurate but computationally expensive for large datasets.
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
for train_idx, test_idx in loo.split(X):
# Train on all but one, test on one
pass
๐ Summary
- K-Fold: Balanced between speed and performance.
- Stratified K-Fold: Maintains class distribution.
- LOOCV: High variance, very thorough but slow.
Comments
Post a Comment