Availability is the fraction of time a system is operational, expressed as a percentage SLA. The gap between 99.9% ("three nines") and 99.999% ("five nines") is enormous: three nines permits ~8.7 hours of downtime per year while five nines permits only ~5 minutes. Availability is achieved through redundancy (no single points of failure), failover automation, and health-check-driven traffic routing. Composite system availability multiplies individual component availabilities — three 99.9% components in series yields only 99.7% end-to-end.

Key Points

  • Serial dependencies multiply unavailability: A=99.9%, B=99.9%, C=99.9% in series → 99.7% composite — always minimize critical-path depth.
  • Active-active deployments across multiple AZs or regions eliminate single-AZ failures; active-passive cuts costs but adds failover delay.
  • Health checks must test actual business logic, not just port liveness — shallow health checks mask degraded dependencies.
  • Circuit breakers prevent cascading failures by fast-failing calls to degraded upstreams, shedding load before the caller times out.
  • Bulkhead pattern isolates failure domains — separate thread pools or connection pools for critical vs. non-critical paths.
  • Blue-green and canary deployments reduce deployment-caused downtime, which is the #1 source of avoidable outages.
  • SLA vs SLO vs SLI: SLI is the metric (latency), SLO is your internal target (P99 < 200 ms), SLA is the external contractual commitment with penalties.
  • Error budgets (100% − SLO) give teams a risk budget — burning 50% of the monthly error budget mid-month triggers a freeze on risky deploys.
AvailabilityDowntime / YearDowntime / MonthDowntime / WeekTypical Use Case
99% ("two nines")3.65 days7.31 hours1.68 hoursInternal tools, batch jobs
99.9% ("three nines")8.77 hours43.83 minutes10.08 minutesStandard SaaS, most APIs
99.95%4.38 hours21.92 minutes5.04 minutesBusiness-critical SaaS
99.99% ("four nines")52.60 minutes4.38 minutes1.01 minutesFinancial platforms, AWS core
99.999% ("five nines")5.26 minutes26.30 seconds6.05 secondsTelecom, cardiac monitors
99.9999% ("six nines")31.56 seconds2.63 seconds< 1 secondAir traffic control, nuclear

Real-World Example

AWS S3 advertises 99.99% availability and 99.999999999% (eleven nines) durability — these are distinct: availability is about access, durability is about data survival. The 2017 S3 us-east-1 outage dropped effective availability below 99% for ~4 hours, illustrating why large customers use multi-region replication.