Incident Management
On-call tooling (PagerDuty, OpsGenie), severity classification, war rooms
Incident management is the structured process of detecting, triaging, communicating, and resolving service degradations with minimal mean time to recovery (MTTR). Tools like PagerDuty and OpsGenie orchestrate on-call routing, escalation, and stakeholder communication. Severity classification (P1–P4) defines response SLAs: P1 (all users impacted) requires a response within minutes, while P4 (cosmetic issue) can wait for business hours. War rooms (incident bridges — Zoom/Slack channels) bring the incident commander, domain experts, and communications lead together with clear role separation to prevent confusion and duplicate work.
Key Points
- Incident commander (IC) role: single decision-maker who coordinates investigation, avoids doing the work themselves, and drives toward mitigation — borrowed from aviation and military frameworks
- PagerDuty escalation policies: primary on-call → secondary (after 5 min no-ack) → manager (after 15 min) → all-hands bridge; configure per service with different policies for P1 vs P3
- Severity matrix: P1 = complete outage or >20% users impacted, 15-min ack SLA; P2 = major degradation, 30-min ack; P3 = partial degradation, 2-hour ack; P4 = minor/cosmetic, next-business-day
- Communications lead role: drafts external status page updates every 15–30 minutes during P1/P2; keeps executives informed without interrupting engineering investigation
- Triage hierarchy: verify alert is real → define customer impact → identify blast radius → choose mitigation path (rollback fastest) → implement fix → monitor for 30 minutes → close
- Rollback first principle: if a deploy preceded the incident by < 30 minutes, roll back immediately without investigation — restores service in minutes vs hours of root cause analysis
- Slack incident channels: auto-created with naming convention #incident-YYYY-MM-DD-<service>; archived post-incident; all decisions logged for postmortem timeline reconstruction
- MTTR benchmarks: Google SRE target < 30 minutes for P1; industry median is 4.2 hours (Atlassian State of Incident Management 2023); achieving < 1 hour requires practiced runbooks and game days
Real-World Example
GitHub uses an Incident Management framework with a dedicated Incident Commander rotation (separate from on-call engineering); their public status page updates are scripted templates filled in by the comms lead, not engineers — this practice reduced customer confusion during the 2018 and 2022 major outages.