Availability
Uptime SLAs (99.9% ≈ 8.7 hr/yr · 99.99% ≈ 52 min/yr), redundancy models
Availability is the fraction of time a system is operational, expressed as a percentage SLA. The gap between 99.9% ("three nines") and 99.999% ("five nines") is enormous: three nines permits ~8.7 hours of downtime per year while five nines permits only ~5 minutes. Availability is achieved through redundancy (no single points of failure), failover automation, and health-check-driven traffic routing. Composite system availability multiplies individual component availabilities — three 99.9% components in series yields only 99.7% end-to-end.
Key Points
- Serial dependencies multiply unavailability: A=99.9%, B=99.9%, C=99.9% in series → 99.7% composite — always minimize critical-path depth.
- Active-active deployments across multiple AZs or regions eliminate single-AZ failures; active-passive cuts costs but adds failover delay.
- Health checks must test actual business logic, not just port liveness — shallow health checks mask degraded dependencies.
- Circuit breakers prevent cascading failures by fast-failing calls to degraded upstreams, shedding load before the caller times out.
- Bulkhead pattern isolates failure domains — separate thread pools or connection pools for critical vs. non-critical paths.
- Blue-green and canary deployments reduce deployment-caused downtime, which is the #1 source of avoidable outages.
- SLA vs SLO vs SLI: SLI is the metric (latency), SLO is your internal target (P99 < 200 ms), SLA is the external contractual commitment with penalties.
- Error budgets (100% − SLO) give teams a risk budget — burning 50% of the monthly error budget mid-month triggers a freeze on risky deploys.
| Availability | Downtime / Year | Downtime / Month | Downtime / Week | Typical Use Case |
|---|---|---|---|---|
| 99% ("two nines") | 3.65 days | 7.31 hours | 1.68 hours | Internal tools, batch jobs |
| 99.9% ("three nines") | 8.77 hours | 43.83 minutes | 10.08 minutes | Standard SaaS, most APIs |
| 99.95% | 4.38 hours | 21.92 minutes | 5.04 minutes | Business-critical SaaS |
| 99.99% ("four nines") | 52.60 minutes | 4.38 minutes | 1.01 minutes | Financial platforms, AWS core |
| 99.999% ("five nines") | 5.26 minutes | 26.30 seconds | 6.05 seconds | Telecom, cardiac monitors |
| 99.9999% ("six nines") | 31.56 seconds | 2.63 seconds | < 1 second | Air traffic control, nuclear |
Real-World Example
AWS S3 advertises 99.99% availability and 99.999999999% (eleven nines) durability — these are distinct: availability is about access, durability is about data survival. The 2017 S3 us-east-1 outage dropped effective availability below 99% for ~4 hours, illustrating why large customers use multi-region replication.