Disaster Recovery | Non-Functional Requirements | System Design

Disaster Recovery (DR) defines how a system recovers from catastrophic failures that exceed normal fault-tolerance mechanisms — data center fires, region-wide cloud outages, ransomware, or accidental mass deletions. The two fundamental metrics are RTO (Recovery Time Objective: maximum tolerable downtime) and RPO (Recovery Point Objective: maximum tolerable data loss). Smaller RTO and RPO require more expensive infrastructure — a hot standby with synchronous replication achieves near-zero RTO/RPO but costs 2× the primary infrastructure. DR plans must be regularly tested via disaster recovery drills; an untested DR plan is not a DR plan.

Key Points

RTO is a business requirement, not a technical one — the cost of downtime (lost revenue, SLA penalties, reputation) must be calculated before choosing a DR tier.
RPO determines backup frequency and replication strategy: RPO of 1 hour → hourly snapshots; RPO of 5 minutes → streaming replication or CDC; RPO ≈ 0 → synchronous multi-region writes.
Backup 3-2-1 rule: 3 copies of data, on 2 different media types, with 1 copy offsite — modern cloud equivalent is cross-region + Glacier Deep Archive.
Synchronous replication guarantees RPO=0 but adds write latency equal to the round-trip time to the replica (≥70 ms cross-region); asynchronous replication is faster but allows data loss on failover.
Chaos Game Days — scheduled DR drills where the team simulates a region failure — are the only reliable way to validate RTO targets before a real disaster.
Database point-in-time recovery (PITR) protects against logical corruption and accidental deletions that replication propagates — enable PITR on all production databases.
Runbooks must be executable by on-call engineers under stress at 3 AM — automate every step possible; manual runbooks degrade in accuracy and slow RTO.
Multi-cloud DR is extreme but used by regulated industries (finance, healthcare) to eliminate cloud provider lock-in as a single point of failure.

DR Tier	Strategy	RTO	RPO	Relative Cost	Example Use Case
Tier 0: Backup & Restore	Cold backups in object storage; restore from scratch	24–72 hours	24 hours	1×	Internal tools, dev environments
Tier 1: Pilot Light	Core infrastructure pre-provisioned but scaled to zero; data synced	4–8 hours	1–4 hours	1.2×	Non-critical SaaS, SMB workloads
Tier 2: Warm Standby	Scaled-down replica running, DB synced, can scale up on failover	30 min – 2 hours	5–30 minutes	1.5×	E-commerce, B2B platforms
Tier 3: Hot Standby (Active-Passive)	Full-capacity replica, automated failover via DNS/load balancer	< 15 minutes	< 5 minutes	2×	Banking, healthcare systems
Tier 4: Active-Active Multi-Region	Traffic split across regions; synchronous or async replication	< 1 minute	≈ 0 (sync) or seconds (async)	2–3×	Global payment networks, trading

Real-World Example

GitLab's 2017 database incident is a canonical DR cautionary tale: an accidental `rm -rf` on the primary database destroyed data, and all five backup methods failed or were incomplete. The 6-hour RTO and ~18 hours of RPO caused by an untested DR plan led GitLab to publish a detailed postmortem and invest heavily in automated backup verification — backup restore testing now runs automatically every 24 hours.

←PreviousCost Efficiency NextInteroperability→