A model that performs perfectly on training data but fails on new data is useless. Rigorous evaluation using the right metrics for your problem type is essential before trusting a model in production.

Key Points

  • Accuracy: fraction of correct predictions — misleading when classes are imbalanced
  • Precision: of all predicted positives, how many were actually positive (minimise false positives)
  • Recall (Sensitivity): of all actual positives, how many did we catch (minimise false negatives)
  • F1 Score: harmonic mean of precision and recall — good for imbalanced datasets
  • AUC-ROC: area under the ROC curve; model's ability to discriminate classes regardless of threshold
  • Confusion Matrix: table of TP, FP, TN, FN — gives full picture of classification errors
  • RMSE / MAE: regression metrics — root mean squared error vs mean absolute error
  • Cross-Validation: k-fold splits data into k parts; trains k times; reduces evaluation variance
  • Bias-Variance Trade-off: high bias = underfitting; high variance = overfitting
  • Learning Curves: plot training/validation error vs dataset size — diagnose under/overfitting
MetricFormulaUse When
Accuracy(TP+TN)/(TP+TN+FP+FN)Balanced classes
PrecisionTP/(TP+FP)False positives are costly (spam filter)
RecallTP/(TP+FN)False negatives are costly (cancer detection)
F12×(P×R)/(P+R)Imbalanced classes, both errors matter
AUC-ROCArea under curveRanking/probability calibration tasks

Real-World Example

A cancer screening model with 99% accuracy might sound great — until you learn that 99% of patients don't have cancer, so predicting "no cancer" for everyone achieves 99% accuracy. Recall (catching every true cancer case) is the metric that actually matters.