Blameless Postmortems
Timeline reconstruction, contributing factors, action items, cultural safety
A blameless postmortem (also called a post-incident review or PIR) is a structured analysis of an incident that focuses on systemic failure modes rather than individual mistakes, rooted in the principle that engineers acted rationally given the information and tools available to them at the time. The output is a timeline, a list of contributing factors (not root causes — complex systems have multiple contributing factors), and action items with owners and due dates that address systemic gaps. Google's SRE book popularized the format; companies like Etsy and Netflix have made blameless culture a cornerstone of their engineering identity.
Key Points
- Blameless principle: focus on what conditions made the failure possible, not who made a mistake; engineers who fear blame will hide information, slowing future incident response
- Timeline reconstruction: use log timestamps, deployment records, alert fire times, and Slack message history to build a minute-by-minute chronology of events and decisions
- Contributing factors (not "root cause"): complex systems have multiple contributing factors — a config mistake, a lack of alerting, a missing circuit breaker; list all of them
- Five Whys technique: ask "why" recursively for each contributing factor until you reach a systemic or process gap, not a human error; stop when the answer is actionable
- Action items must be SMART: Specific, Measurable, Assignable, Relevant, Time-bound; vague actions ("improve monitoring") are useless; "add p99 latency alert for payment-service by 2024-02-01 — @alice" is actionable
- Postmortem SLA: write first draft within 48 hours of incident resolution while memory is fresh; review with team within 5 business days; publish internally within 2 weeks
- Leading vs lagging indicators: postmortem action items should create leading indicators (alert before the next incident) not just fix the immediate symptom
- Postmortem culture maturity: share postmortems broadly (across the engineering org, not just the team) — learning spreads, duplicate mistakes decrease, trust in engineering increases
Real-World Example
Etsy's engineering blog became a benchmark for blameless culture; they publish detailed public postmortems including "what we got wrong" and "what we got right," which has become a recruitment differentiator — candidates cite Etsy's postmortem culture as a reason for joining.