SLI / SLO / SLA | Observability & Operations | System Design

Google SRE formalized the SLI → SLO → SLA hierarchy: a Service Level Indicator (SLI) is a quantitative measure of service behavior (e.g., request success rate); a Service Level Objective (SLO) is an internal reliability target (e.g., 99.9% success over 30 days); and a Service Level Agreement (SLA) is a contractual commitment with financial penalties for breach. The gap between SLO and SLA is the error budget — the acceptable amount of unreliability that funds feature velocity. Error budget burn rate alerts (multi-window, multi-burn-rate) catch incidents that consume the budget faster than the measurement window would reveal.

Key Points

SLIs must be measurable at the boundary the user experiences: success rate = good_requests / total_requests; latency = % of requests completing under threshold
Avoid availability as SLI (uptime %); prefer request success rate because it reflects actual user experience, not just ping reachability
SLO window: rolling 30-day windows are standard; a 99.9% SLO = 43.2 minutes of error budget per month
Error budget burn rate: 1x burn rate exhausts the budget in 30 days; 14.4x burn rate exhausts it in 2 hours — trigger a page at 14.4x over 1 hour AND 6x over 6 hours (Google multi-window alerting)
SLO-based alerts fire only when error budget is being consumed meaningfully, drastically reducing alert noise vs threshold-based alerting on raw metrics
Toil vs reliability: when error budget is exhausted, freeze feature deployments and invest the team's time in reliability improvements — this is the key SRE trade-off mechanism
Customer-facing SLAs should be 10–20% below internal SLOs to absorb measurement noise and give engineering headroom before financial penalties
SLI categories: availability (success/total), latency (p99 < threshold / total), freshness (data age < threshold / total queries), correctness (valid responses / total)

Term	Definition	Who Sets It	Real-World Example
SLI	Quantitative measure of service behavior; ratio of good events to total events	SRE / Engineering	99.5% of payment API requests return 2xx within 500ms, measured over 1-minute windows
SLO	Internal target for SLI performance over a rolling time window	SRE + Product	Payment API SLI >= 99.9% success rate over rolling 30 days (43.2 min error budget/month)
SLA	External contractual commitment; breach triggers financial or other consequences	Legal + Business	Stripe SLA: 99.9% API uptime/month; breach triggers 10% service credit up to 30% of monthly bill
Error Budget	Allowable unreliability = 1 - SLO; consumed by outages, bad deploys, infra failures	SRE (computed)	99.9% SLO → 0.1% budget → 43.2 min/month; a 5-minute incident consumes 11.6% of budget
Burn Rate	Speed at which error budget is consumed relative to normal depletion	SRE Alert Rule	14.4x burn rate = budget exhausted in 2 hours; triggers P1 page with immediate response required
Error Budget Policy	Agreement on what happens when budget is exhausted (freeze features, focus on reliability)	VP Eng + Product	If budget < 10% remaining, halt new feature launches; SRE and dev pair on reliability sprint

Real-World Example

Google Search maintains a 99.999% SLO (5.26 minutes downtime/year); when the error budget is at risk, the release process automatically halts and SRE takes over all changes — a process enforced by tooling, not manual policy.

←PreviousDistributed Tracing NextAlerting Strategy→