Auto-Scaling | Performance & Scalability | System Design

Auto-scaling dynamically adjusts compute capacity in response to demand, optimizing cost while maintaining performance. Reactive scaling responds to current metrics (CPU, memory, request rate); predictive scaling uses ML to forecast demand and pre-provisions capacity before traffic arrives; scheduled scaling handles predictable patterns (business hours, weekly cycles). AWS Auto Scaling, Kubernetes HPA/KEDA, and GCP Managed Instance Groups all implement these modes, with KEDA enabling event-driven scaling from zero for serverless-like economics.

Key Points

Reactive scaling: AWS ASG scales when CloudWatch alarms fire (CPU > 70% for 2 consecutive 5-minute periods) — adds instances in ~3 minutes.
Kubernetes HPA: scales Deployment replica count based on CPU/memory utilization or custom metrics (Prometheus) — evaluates every 15 seconds.
KEDA (Kubernetes Event-Driven Autoscaling): scales pods based on external queue depth (SQS, Kafka, RabbitMQ) — enables scale-to-zero for batch workloads.
Predictive scaling: AWS Predictive Scaling ML model trains on 14 days of history — pre-provisions capacity 10 minutes before predicted load spike.
Scheduled scaling: set minimum capacity to 50 instances on weekday mornings, reduce to 10 at night — eliminates reactive lag for known traffic patterns.
Cooldown period: prevents thrashing — after a scale-out, wait 300 seconds before evaluating another scale-out; shorter cooldowns for scale-in to release capacity faster.
Scale-in protection: mark in-flight processing instances as protected from scale-in — prevents auto-scaling from terminating instances with active jobs.
Target Tracking: simplest policy — maintain average CPU at 60%; AWS continuously adjusts capacity to hit the target, handling both scale-out and scale-in automatically.

Real-World Example

Netflix uses predictive auto-scaling calibrated to TV show airing schedules — they scale out AWS capacity 30 minutes before major show premieres and awards shows, avoiding the reactive lag that caused streaming failures in early deployments. Their Scryer system forecasts capacity needs 24 hours in advance.

←PreviousHorizontal vs Vertical NextAsync Processing→