Machine Learning Tutorial Part 8: Feature Engineering & Selection
🔧 What is Feature Engineering?
Feature engineering is the process of creating new input features or modifying existing ones to improve the performance of a machine learning model. It often involves:
- Handling missing data
- Converting categorical data
- Scaling and normalizing values
- Creating new features from existing ones
💡 Examples of Feature Engineering
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# Sample Data
df = pd.DataFrame({
'Age': [25, 30, 45],
'Gender': ['Male', 'Female', 'Female']
})
# One-hot encode Gender
df_encoded = pd.get_dummies(df, columns=['Gender'])
# Scale Age
scaler = StandardScaler()
df_encoded['Age_scaled'] = scaler.fit_transform(df_encoded[['Age']])
print(df_encoded)
📉 What is Feature Selection?
Feature selection involves choosing the most relevant features from your dataset to improve model performance and reduce overfitting. There are three main methods:
- Filter Methods: Use statistical measures (e.g., correlation, chi-square).
- Wrapper Methods: Use a predictive model to evaluate subsets (e.g., recursive feature elimination).
- Embedded Methods: Feature selection occurs during model training (e.g., Lasso regression).
⚙️ Example: Recursive Feature Elimination (RFE)
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
X, y = load_iris(return_X_y=True)
model = LogisticRegression()
rfe = RFE(model, n_features_to_select=2)
fit = rfe.fit(X, y)
print("Selected Features:", fit.support_)
print("Feature Ranking:", fit.ranking_)
📌 Summary
- Feature engineering improves data quality and model performance.
- Feature selection reduces dimensionality, speeds up training, and improves generalization.
- Use tools like
get_dummies
,StandardScaler
, andRFE
fromscikit-learn
.
Comments
Post a Comment