Disaster Recovery (DR) defines how a system recovers from catastrophic failures that exceed normal fault-tolerance mechanisms — data center fires, region-wide cloud outages, ransomware, or accidental mass deletions. The two fundamental metrics are RTO (Recovery Time Objective: maximum tolerable downtime) and RPO (Recovery Point Objective: maximum tolerable data loss). Smaller RTO and RPO require more expensive infrastructure — a hot standby with synchronous replication achieves near-zero RTO/RPO but costs 2× the primary infrastructure. DR plans must be regularly tested via disaster recovery drills; an untested DR plan is not a DR plan.

Key Points

  • RTO is a business requirement, not a technical one — the cost of downtime (lost revenue, SLA penalties, reputation) must be calculated before choosing a DR tier.
  • RPO determines backup frequency and replication strategy: RPO of 1 hour → hourly snapshots; RPO of 5 minutes → streaming replication or CDC; RPO ≈ 0 → synchronous multi-region writes.
  • Backup 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 copy offsite — modern cloud equivalent is cross-region + Glacier Deep Archive.
  • Synchronous replication guarantees RPO=0 but adds write latency equal to the round-trip time to the replica (≥70 ms cross-region); asynchronous replication is faster but allows data loss on failover.
  • Chaos Game Days — scheduled DR drills where the team simulates a region failure — are the only reliable way to validate RTO targets before a real disaster.
  • Database point-in-time recovery (PITR) protects against logical corruption and accidental deletions that replication propagates — enable PITR on all production databases.
  • Runbooks must be executable by on-call engineers under stress at 3 AM — automate every step possible; manual runbooks degrade in accuracy and slow RTO.
  • Multi-cloud DR is extreme but used by regulated industries (finance, healthcare) to eliminate cloud provider lock-in as a single point of failure.
DR TierStrategyRTORPORelative CostExample Use Case
Tier 0: Backup & RestoreCold backups in object storage; restore from scratch24–72 hours24 hoursInternal tools, dev environments
Tier 1: Pilot LightCore infrastructure pre-provisioned but scaled to zero; data synced4–8 hours1–4 hours1.2×Non-critical SaaS, SMB workloads
Tier 2: Warm StandbyScaled-down replica running, DB synced, can scale up on failover30 min – 2 hours5–30 minutes1.5×E-commerce, B2B platforms
Tier 3: Hot Standby (Active-Passive)Full-capacity replica, automated failover via DNS/load balancer< 15 minutes< 5 minutesBanking, healthcare systems
Tier 4: Active-Active Multi-RegionTraffic split across regions; synchronous or async replication< 1 minute≈ 0 (sync) or seconds (async)2–3×Global payment networks, trading

Real-World Example

GitLab's 2017 database incident is a canonical DR cautionary tale: an accidental `rm -rf` on the primary database destroyed data, and all five backup methods failed or were incomplete. The 6-hour RTO and ~18 hours of RPO caused by an untested DR plan led GitLab to publish a detailed postmortem and invest heavily in automated backup verification — backup restore testing now runs automatically every 24 hours.