Site Reliability Engineering (SRE), codified by Google in the 2016 SRE Book, applies software engineering to operations with the goal of creating scalable and highly reliable systems. The central mechanism is the error budget: if your SLO is 99.9% availability, you have 43.8 minutes/month of allowable downtime — the error budget. Teams spend the budget on risky changes (new deployments); when it is exhausted, deployments freeze until it replenishes. This transforms reliability from a qualitative goal into a quantifiable engineering constraint.

Key Points

  • SLI (Service Level Indicator): the actual measurement — request success rate, latency p99, availability percentage.
  • SLO (Service Level Objective): the target — "99.9% of requests complete in <200ms" — agreed internally between SRE and product.
  • SLA (Service Level Agreement): the contractual commitment to customers with financial penalties — always weaker than internal SLO to provide headroom.
  • Error budget: 100% - SLO target; the allowable unreliability per window — 99.9% SLO = 43.8 min/month error budget.
  • Toil: manual, repetitive, automatable operational work — Google SRE policy caps toil at 50% of engineer time, remainder is engineering projects.
  • Eliminating toil: automate runbooks, self-healing alerting, auto-remediation scripts — every page that fires twice without automation is a bug.
  • SRE vs DevOps: SRE is a concrete implementation of DevOps philosophy — SRE uses error budgets as the mechanism for balancing reliability and velocity.
  • Postmortem culture: every SEV-1/SEV-2 incident triggers a blameless postmortem within 48 hours, producing action items tracked to completion.

Real-World Example

Google's Search SLO is 99.999% availability (5.26 minutes/year downtime). Their SRE teams automate away toil so aggressively that a team of 10 SREs can manage 500+ microservices. Error budget consumption is reviewed in weekly SRE/Product meetings as a first-class engineering metric.