Alerting Strategy | Observability & Operations | System Design

Effective alerting is the most operationally impactful aspect of observability — poor alerting causes alert fatigue, where engineers ignore pages because too many are false positives. The shift from threshold-based to SLO-based alerting (multi-window burn rate) dramatically reduces noise while catching real incidents faster. Every alert must have a corresponding runbook (linked in the alert body) with triage steps, escalation contacts, and remediation actions. On-call rotations via PagerDuty or OpsGenie distribute burden fairly while maintaining response SLAs, and escalation policies ensure alerts reach a human even if the primary responder is unavailable.

Key Points

Alert on symptoms, not causes: page on "user-facing error rate > 1%" not "CPU > 80%" — CPU spikes may be harmless; user errors always require investigation
Multi-window burn rate alerts: 1h window at 14.4x + 6h window at 6x catches fast-burning incidents early; 3d window at 1x catches slow leaks — cover all scenarios
Every alert must have: severity (P1–P4), runbook link, affected service/SLI, error budget remaining, and a clear title answering "what is broken for whom"
Alert deduplication in Alertmanager: group alerts by (alertname, cluster, service) to prevent 100 identical pages for one Kubernetes node failure
Inhibition rules: suppress downstream service alerts when the upstream dependency is already alerting — prevents alert storms from single root causes
Alert fatigue reduction: review weekly; retire alerts with < 5% action rate in 90 days; convert noisy alerts to Slack-only notifications before removing them
On-call hygiene: primary + secondary rotation, maximum 8 hours of interrupted sleep per week per engineer (Google SRE standard), monthly on-call retrospectives
Runbooks should be executable, not encyclopedic: include copy-paste kubectl/aws-cli commands, decision trees ("if X see Y, else see Z"), and escalation phone numbers

Real-World Example

PagerDuty's own engineering team reported a 75% reduction in mean time to acknowledge (MTTA) after switching from threshold-based to SLO burn-rate alerts — engineers trusted pages more because nearly all were actionable.

←PreviousSLI / SLO / SLA NextHealth Checks & Synthetic→