Model Evaluation
Accuracy, precision, recall, F1, AUC-ROC, confusion matrix, cross-validation
A model that performs perfectly on training data but fails on new data is useless. Rigorous evaluation using the right metrics for your problem type is essential before trusting a model in production.
Key Points
- Accuracy: fraction of correct predictions — misleading when classes are imbalanced
- Precision: of all predicted positives, how many were actually positive (minimise false positives)
- Recall (Sensitivity): of all actual positives, how many did we catch (minimise false negatives)
- F1 Score: harmonic mean of precision and recall — good for imbalanced datasets
- AUC-ROC: area under the ROC curve; model's ability to discriminate classes regardless of threshold
- Confusion Matrix: table of TP, FP, TN, FN — gives full picture of classification errors
- RMSE / MAE: regression metrics — root mean squared error vs mean absolute error
- Cross-Validation: k-fold splits data into k parts; trains k times; reduces evaluation variance
- Bias-Variance Trade-off: high bias = underfitting; high variance = overfitting
- Learning Curves: plot training/validation error vs dataset size — diagnose under/overfitting
| Metric | Formula | Use When |
|---|---|---|
| Accuracy | (TP+TN)/(TP+TN+FP+FN) | Balanced classes |
| Precision | TP/(TP+FP) | False positives are costly (spam filter) |
| Recall | TP/(TP+FN) | False negatives are costly (cancer detection) |
| F1 | 2×(P×R)/(P+R) | Imbalanced classes, both errors matter |
| AUC-ROC | Area under curve | Ranking/probability calibration tasks |
Real-World Example
A cancer screening model with 99% accuracy might sound great — until you learn that 99% of patients don't have cancer, so predicting "no cancer" for everyone achieves 99% accuracy. Recall (catching every true cancer case) is the metric that actually matters.