ML Pipeline & MLOps
Data prep, training, evaluation, deployment, monitoring, model registry
A production ML pipeline is far more than just model training. It includes data ingestion, validation, feature engineering, training, evaluation, deployment, and monitoring — all automated, versioned, and reproducible. MLOps is the discipline that operationalises this lifecycle.
Key Points
- MLOps = DevOps principles applied to ML: automation, versioning, CI/CD, monitoring
- Data Versioning: track changes to datasets (DVC, Delta Lake) — training data is code
- Feature Store: centralised repository of curated features shared across models (Feast, Hopsworks)
- Experiment Tracking: log hyperparameters, metrics, and artefacts for every run (MLflow, W&B)
- Model Registry: store, version, and stage models (staging → production) with metadata
- CI/CD for ML: automated pipeline from data validation → training → evaluation → deployment
- Model Serving: REST endpoints (FastAPI, TF Serving), batch scoring, streaming inference
- Monitoring: detect data drift (distribution shifts), concept drift (world has changed), performance decay
- A/B Testing & Shadow Mode: validate new model vs old model in production with real traffic
- Retraining Triggers: scheduled, event-driven (drift detected), or performance threshold breach
| MLOps Stage | Tools | Key Concern |
|---|---|---|
| Data Management | DVC, Great Expectations, Delta Lake | Validation, versioning, lineage |
| Experiment Tracking | MLflow, Weights & Biases, Comet | Reproducibility, comparison |
| Training | SageMaker, Vertex AI, Databricks | Scale, cost, GPU utilisation |
| Model Registry | MLflow Registry, SageMaker Model Registry | Versioning, staging, rollback |
| Serving | FastAPI, TF Serving, Triton, BentoML | Latency, throughput, scalability |
| Monitoring | Evidently AI, WhyLabs, Grafana | Data drift, model decay |
Real-World Example
Uber's Michelangelo platform processes petabytes of ride data to train hundreds of ML models — surge pricing, ETA estimation, fraud detection. The platform automates the full lifecycle: feature engineering → training → deployment → monitoring at massive scale.