Reliability is the probability that a system performs its intended function correctly for a specified time period — distinct from availability, which only measures uptime. Key metrics are MTBF (Mean Time Between Failures) for how often failures occur and MTTR (Mean Time To Recover) for how quickly the system heals. A reliable system minimizes both: MTBF is driven by fault-tolerant architecture and code quality; MTTR by automated recovery, runbooks, and observability. Chaos engineering, pioneered by Netflix's Chaos Monkey, proactively injects failures to validate resilience assumptions.

Key Points

  • MTBF = total uptime / number of failures; improving MTBF requires redundancy, better testing, and root-cause elimination, not just faster on-call response.
  • MTTR = total downtime / number of failures; fastest MTTR path is automated detection + automated rollback, avoiding human-in-the-loop for common failure modes.
  • Availability = MTBF / (MTBF + MTTR) — doubling MTBF or halving MTTR both improve availability, but halving MTTR is often faster to achieve operationally.
  • Fault tolerance tiers: fail-fast (detect and surface immediately), fail-safe (revert to safe state), fail-operational (degrade gracefully, maintain core function).
  • Chaos engineering must target production (or a production-faithful staging env) — chaos in dev is theater, since production dependencies differ.
  • Retry with exponential backoff and jitter prevents thundering herd after a dependency recovers; always set a retry budget (max attempts or deadline).
  • Idempotent operations make retries safe — design APIs so repeating a request has the same effect as the first call (use idempotency keys for payments).
  • Blast radius containment: deploy in waves (1%→10%→50%→100%), use feature flags for instant kill-switch, partition data by customer to limit failure scope.

Real-World Example

Netflix's Chaos Monkey has been running in production since 2010, randomly terminating EC2 instances during business hours. This forced engineering teams to build genuine resilience — services that had not been chaos-tested consistently failed during AWS AZ outages while chaos-hardened services survived transparently.