Google SRE formalized the SLI → SLO → SLA hierarchy: a Service Level Indicator (SLI) is a quantitative measure of service behavior (e.g., request success rate); a Service Level Objective (SLO) is an internal reliability target (e.g., 99.9% success over 30 days); and a Service Level Agreement (SLA) is a contractual commitment with financial penalties for breach. The gap between SLO and SLA is the error budget — the acceptable amount of unreliability that funds feature velocity. Error budget burn rate alerts (multi-window, multi-burn-rate) catch incidents that consume the budget faster than the measurement window would reveal.

Key Points

  • SLIs must be measurable at the boundary the user experiences: success rate = good_requests / total_requests; latency = % of requests completing under threshold
  • Avoid availability as SLI (uptime %); prefer request success rate because it reflects actual user experience, not just ping reachability
  • SLO window: rolling 30-day windows are standard; a 99.9% SLO = 43.2 minutes of error budget per month
  • Error budget burn rate: 1x burn rate exhausts the budget in 30 days; 14.4x burn rate exhausts it in 2 hours — trigger a page at 14.4x over 1 hour AND 6x over 6 hours (Google multi-window alerting)
  • SLO-based alerts fire only when error budget is being consumed meaningfully, drastically reducing alert noise vs threshold-based alerting on raw metrics
  • Toil vs reliability: when error budget is exhausted, freeze feature deployments and invest the team's time in reliability improvements — this is the key SRE trade-off mechanism
  • Customer-facing SLAs should be 10–20% below internal SLOs to absorb measurement noise and give engineering headroom before financial penalties
  • SLI categories: availability (success/total), latency (p99 < threshold / total), freshness (data age < threshold / total queries), correctness (valid responses / total)
TermDefinitionWho Sets ItReal-World Example
SLIQuantitative measure of service behavior; ratio of good events to total eventsSRE / Engineering99.5% of payment API requests return 2xx within 500ms, measured over 1-minute windows
SLOInternal target for SLI performance over a rolling time windowSRE + ProductPayment API SLI >= 99.9% success rate over rolling 30 days (43.2 min error budget/month)
SLAExternal contractual commitment; breach triggers financial or other consequencesLegal + BusinessStripe SLA: 99.9% API uptime/month; breach triggers 10% service credit up to 30% of monthly bill
Error BudgetAllowable unreliability = 1 - SLO; consumed by outages, bad deploys, infra failuresSRE (computed)99.9% SLO → 0.1% budget → 43.2 min/month; a 5-minute incident consumes 11.6% of budget
Burn RateSpeed at which error budget is consumed relative to normal depletionSRE Alert Rule14.4x burn rate = budget exhausted in 2 hours; triggers P1 page with immediate response required
Error Budget PolicyAgreement on what happens when budget is exhausted (freeze features, focus on reliability)VP Eng + ProductIf budget < 10% remaining, halt new feature launches; SRE and dev pair on reliability sprint

Real-World Example

Google Search maintains a 99.999% SLO (5.26 minutes downtime/year); when the error budget is at risk, the release process automatically halts and SRE takes over all changes — a process enforced by tooling, not manual policy.