Feature Engineering
Normalization, encoding, selection, extraction, handling missing data
Feature engineering is the process of transforming raw data into informative inputs that ML models can learn from effectively. For traditional ML (non-deep learning), features are often hand-crafted and have a massive impact on model performance.
Key Points
- Normalisation / Standardisation: scale features to similar ranges (min-max or z-score) — critical for distance-based models
- One-Hot Encoding: convert categorical variables to binary columns (e.g., colour → is_red, is_blue)
- Label Encoding: map categories to integers — be careful of implicit ordering
- Feature Selection: remove irrelevant/redundant features (correlation analysis, mutual information, LASSO)
- Feature Extraction: create new features from raw data (e.g., TF-IDF from text, PCA components)
- Handling Missing Data: impute with mean/median/mode, or use a model that handles nulls (XGBoost)
- Outlier Treatment: remove, cap (winsorize), or transform (log) extreme values
- Feature Interaction: create cross features (e.g., age × income for loan risk)
- Time Series: lag features, rolling averages, calendar features (day of week, is_holiday)
Real-World Example
Airbnb's pricing model engineers hundreds of features from raw listing data: days since last booking, host response rate, neighbourhood review score trends. The features often matter more than the algorithm choice for tabular data.