Cross-Validation Techniques

Machine Learning Tutorial Part 11: Cross-Validation Techniques

Machine Learning Tutorial Part 11: Cross-Validation Techniques

๐Ÿ“š What is Cross-Validation?

Cross-validation is a technique to evaluate the generalization performance of a model. It splits the dataset into training and validation sets multiple times, ensuring the model is not overfitting.

๐Ÿ” K-Fold Cross-Validation

K-Fold splits the data into k subsets (folds). The model trains on k-1 folds and validates on the remaining fold. This repeats k times.

from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

kf = KFold(n_splits=5)
model = LogisticRegression()

accuracies = []

for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    model.fit(X_train, y_train)
    preds = model.predict(X_val)
    accuracies.append(accuracy_score(y_val, preds))

print("Average Accuracy:", np.mean(accuracies))

๐ŸŽฏ Stratified K-Fold

Stratified K-Fold ensures each fold has approximately the same class distribution. It's ideal for classification tasks with imbalanced data.

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5)
for train_idx, test_idx in skf.split(X, y):
    # Same training logic as above
    pass

๐Ÿงช Leave-One-Out Cross-Validation (LOOCV)

Each sample is used once as a validation set while the rest form the training set. It's very accurate but computationally expensive for large datasets.

from sklearn.model_selection import LeaveOneOut

loo = LeaveOneOut()
for train_idx, test_idx in loo.split(X):
    # Train on all but one, test on one
    pass

๐Ÿ“Œ Summary

  • K-Fold: Balanced between speed and performance.
  • Stratified K-Fold: Maintains class distribution.
  • LOOCV: High variance, very thorough but slow.

Comments