Model Evaluation | Machine Learning | AI / ML

A model that performs perfectly on training data but fails on new data is useless. Rigorous evaluation using the right metrics for your problem type is essential before trusting a model in production.

Key Points

Accuracy: fraction of correct predictions — misleading when classes are imbalanced
Precision: of all predicted positives, how many were actually positive (minimise false positives)
Recall (Sensitivity): of all actual positives, how many did we catch (minimise false negatives)
F1 Score: harmonic mean of precision and recall — good for imbalanced datasets
AUC-ROC: area under the ROC curve; model's ability to discriminate classes regardless of threshold
Confusion Matrix: table of TP, FP, TN, FN — gives full picture of classification errors
RMSE / MAE: regression metrics — root mean squared error vs mean absolute error
Cross-Validation: k-fold splits data into k parts; trains k times; reduces evaluation variance
Bias-Variance Trade-off: high bias = underfitting; high variance = overfitting
Learning Curves: plot training/validation error vs dataset size — diagnose under/overfitting

Metric	Formula	Use When
Accuracy	(TP+TN)/(TP+TN+FP+FN)	Balanced classes
Precision	TP/(TP+FP)	False positives are costly (spam filter)
Recall	TP/(TP+FN)	False negatives are costly (cancer detection)
F1	2×(P×R)/(P+R)	Imbalanced classes, both errors matter
AUC-ROC	Area under curve	Ranking/probability calibration tasks

Real-World Example

A cancer screening model with 99% accuracy might sound great — until you learn that 99% of patients don't have cancer, so predicting "no cancer" for everyone achieves 99% accuracy. Recall (catching every true cancer case) is the metric that actually matters.

←PreviousNeural Networks & Deep Learning NextFeature Engineering→