Incident Management | Observability & Operations | System Design

Incident management is the structured process of detecting, triaging, communicating, and resolving service degradations with minimal mean time to recovery (MTTR). Tools like PagerDuty and OpsGenie orchestrate on-call routing, escalation, and stakeholder communication. Severity classification (P1–P4) defines response SLAs: P1 (all users impacted) requires a response within minutes, while P4 (cosmetic issue) can wait for business hours. War rooms (incident bridges — Zoom/Slack channels) bring the incident commander, domain experts, and communications lead together with clear role separation to prevent confusion and duplicate work.

Key Points

Incident commander (IC) role: single decision-maker who coordinates investigation, avoids doing the work themselves, and drives toward mitigation — borrowed from aviation and military frameworks
PagerDuty escalation policies: primary on-call → secondary (after 5 min no-ack) → manager (after 15 min) → all-hands bridge; configure per service with different policies for P1 vs P3
Severity matrix: P1 = complete outage or >20% users impacted, 15-min ack SLA; P2 = major degradation, 30-min ack; P3 = partial degradation, 2-hour ack; P4 = minor/cosmetic, next-business-day
Communications lead role: drafts external status page updates every 15–30 minutes during P1/P2; keeps executives informed without interrupting engineering investigation
Triage hierarchy: verify alert is real → define customer impact → identify blast radius → choose mitigation path (rollback fastest) → implement fix → monitor for 30 minutes → close
Rollback first principle: if a deploy preceded the incident by < 30 minutes, roll back immediately without investigation — restores service in minutes vs hours of root cause analysis
Slack incident channels: auto-created with naming convention #incident-YYYY-MM-DD-<service>; archived post-incident; all decisions logged for postmortem timeline reconstruction
MTTR benchmarks: Google SRE target < 30 minutes for P1; industry median is 4.2 hours (Atlassian State of Incident Management 2023); achieving < 1 hour requires practiced runbooks and game days

Real-World Example

GitHub uses an Incident Management framework with a dedicated Incident Commander rotation (separate from on-call engineering); their public status page updates are scripted templates filled in by the comms lead, not engineers — this practice reduced customer confusion during the 2018 and 2022 major outages.

←PreviousHealth Checks & Synthetic NextBlameless Postmortems→