Disaster Recovery
RTO, RPO, backup and restore strategies, failover planning
Disaster Recovery (DR) defines how a system recovers from catastrophic failures that exceed normal fault-tolerance mechanisms — data center fires, region-wide cloud outages, ransomware, or accidental mass deletions. The two fundamental metrics are RTO (Recovery Time Objective: maximum tolerable downtime) and RPO (Recovery Point Objective: maximum tolerable data loss). Smaller RTO and RPO require more expensive infrastructure — a hot standby with synchronous replication achieves near-zero RTO/RPO but costs 2× the primary infrastructure. DR plans must be regularly tested via disaster recovery drills; an untested DR plan is not a DR plan.
Key Points
- RTO is a business requirement, not a technical one — the cost of downtime (lost revenue, SLA penalties, reputation) must be calculated before choosing a DR tier.
- RPO determines backup frequency and replication strategy: RPO of 1 hour → hourly snapshots; RPO of 5 minutes → streaming replication or CDC; RPO ≈ 0 → synchronous multi-region writes.
- Backup 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 copy offsite — modern cloud equivalent is cross-region + Glacier Deep Archive.
- Synchronous replication guarantees RPO=0 but adds write latency equal to the round-trip time to the replica (≥70 ms cross-region); asynchronous replication is faster but allows data loss on failover.
- Chaos Game Days — scheduled DR drills where the team simulates a region failure — are the only reliable way to validate RTO targets before a real disaster.
- Database point-in-time recovery (PITR) protects against logical corruption and accidental deletions that replication propagates — enable PITR on all production databases.
- Runbooks must be executable by on-call engineers under stress at 3 AM — automate every step possible; manual runbooks degrade in accuracy and slow RTO.
- Multi-cloud DR is extreme but used by regulated industries (finance, healthcare) to eliminate cloud provider lock-in as a single point of failure.
| DR Tier | Strategy | RTO | RPO | Relative Cost | Example Use Case |
|---|---|---|---|---|---|
| Tier 0: Backup & Restore | Cold backups in object storage; restore from scratch | 24–72 hours | 24 hours | 1× | Internal tools, dev environments |
| Tier 1: Pilot Light | Core infrastructure pre-provisioned but scaled to zero; data synced | 4–8 hours | 1–4 hours | 1.2× | Non-critical SaaS, SMB workloads |
| Tier 2: Warm Standby | Scaled-down replica running, DB synced, can scale up on failover | 30 min – 2 hours | 5–30 minutes | 1.5× | E-commerce, B2B platforms |
| Tier 3: Hot Standby (Active-Passive) | Full-capacity replica, automated failover via DNS/load balancer | < 15 minutes | < 5 minutes | 2× | Banking, healthcare systems |
| Tier 4: Active-Active Multi-Region | Traffic split across regions; synchronous or async replication | < 1 minute | ≈ 0 (sync) or seconds (async) | 2–3× | Global payment networks, trading |
Real-World Example
GitLab's 2017 database incident is a canonical DR cautionary tale: an accidental `rm -rf` on the primary database destroyed data, and all five backup methods failed or were incomplete. The 6-hour RTO and ~18 hours of RPO caused by an untested DR plan led GitLab to publish a detailed postmortem and invest heavily in automated backup verification — backup restore testing now runs automatically every 24 hours.