Alerting Strategy
SLO-based alerting, alert fatigue reduction, runbooks, on-call rotation
Effective alerting is the most operationally impactful aspect of observability — poor alerting causes alert fatigue, where engineers ignore pages because too many are false positives. The shift from threshold-based to SLO-based alerting (multi-window burn rate) dramatically reduces noise while catching real incidents faster. Every alert must have a corresponding runbook (linked in the alert body) with triage steps, escalation contacts, and remediation actions. On-call rotations via PagerDuty or OpsGenie distribute burden fairly while maintaining response SLAs, and escalation policies ensure alerts reach a human even if the primary responder is unavailable.
Key Points
- Alert on symptoms, not causes: page on "user-facing error rate > 1%" not "CPU > 80%" — CPU spikes may be harmless; user errors always require investigation
- Multi-window burn rate alerts: 1h window at 14.4x + 6h window at 6x catches fast-burning incidents early; 3d window at 1x catches slow leaks — cover all scenarios
- Every alert must have: severity (P1–P4), runbook link, affected service/SLI, error budget remaining, and a clear title answering "what is broken for whom"
- Alert deduplication in Alertmanager: group alerts by (alertname, cluster, service) to prevent 100 identical pages for one Kubernetes node failure
- Inhibition rules: suppress downstream service alerts when the upstream dependency is already alerting — prevents alert storms from single root causes
- Alert fatigue reduction: review weekly; retire alerts with < 5% action rate in 90 days; convert noisy alerts to Slack-only notifications before removing them
- On-call hygiene: primary + secondary rotation, maximum 8 hours of interrupted sleep per week per engineer (Google SRE standard), monthly on-call retrospectives
- Runbooks should be executable, not encyclopedic: include copy-paste kubectl/aws-cli commands, decision trees ("if X see Y, else see Z"), and escalation phone numbers
Real-World Example
PagerDuty's own engineering team reported a 75% reduction in mean time to acknowledge (MTTA) after switching from threshold-based to SLO burn-rate alerts — engineers trusted pages more because nearly all were actionable.