Feature engineering is the process of transforming raw data into informative inputs that ML models can learn from effectively. For traditional ML (non-deep learning), features are often hand-crafted and have a massive impact on model performance.

Key Points

  • Normalisation / Standardisation: scale features to similar ranges (min-max or z-score) — critical for distance-based models
  • One-Hot Encoding: convert categorical variables to binary columns (e.g., colour → is_red, is_blue)
  • Label Encoding: map categories to integers — be careful of implicit ordering
  • Feature Selection: remove irrelevant/redundant features (correlation analysis, mutual information, LASSO)
  • Feature Extraction: create new features from raw data (e.g., TF-IDF from text, PCA components)
  • Handling Missing Data: impute with mean/median/mode, or use a model that handles nulls (XGBoost)
  • Outlier Treatment: remove, cap (winsorize), or transform (log) extreme values
  • Feature Interaction: create cross features (e.g., age × income for loan risk)
  • Time Series: lag features, rolling averages, calendar features (day of week, is_holiday)

Real-World Example

Airbnb's pricing model engineers hundreds of features from raw listing data: days since last booking, host response rate, neighbourhood review score trends. The features often matter more than the algorithm choice for tabular data.