Database Reliability | Databases & Data Architecture | System Design

Database reliability encompasses the practices that keep data services available, recoverable, and performant under failure conditions. Connection pooling (PgBouncer, HikariCP, ProxySQL) is the first line of defense against connection exhaustion — PostgreSQL pays ~5 MB RAM and fork overhead per connection. Point-in-time recovery (PITR) combines full base backups with continuous WAL archival, enabling restoration to any second within the retention window. Automated failover (Patroni, AWS RDS Multi-AZ, Aurora Global Database) promotes a replica to primary within 15–60 seconds of detecting primary failure.

Key Points

Connection pooling modes: session (one server connection per client session — compatible but inefficient), transaction (server connection released after each transaction — most common), statement (released after each statement — incompatible with PREPARE/LISTEN).
PgBouncer in transaction mode supports 10,000+ client connections multiplexed onto 50–100 server connections; HikariCP (Java) maintains a pool per application instance with configurable min/max size.
Backup strategies: logical (pg_dump, mysqldump — portable, slow for large DBs), physical (pg_basebackup, Percona XtraBackup — fast, version-specific), continuous WAL archival (WAL-G, pgBackRest) for PITR.
RTO target drives backup strategy: pg_dump restore of 1 TB takes 2–4 hours (high RTO); Aurora PITR can restore to any second in 5–15 minutes using automated snapshot + binlog replay.
AWS RDS Multi-AZ uses synchronous replication to a standby in another AZ; failover is automatic (1–2 minutes) via DNS CNAME flip — your application must handle connection resets.
Aurora Global Database replicates with <1 second lag to secondary regions; promotes secondary to primary in <1 minute for cross-region DR scenarios.
Read-replica promotion is not instantaneous: the replica must first apply all pending WAL/binlog before accepting writes — monitor replica lag to estimate promotion time.
Vacuum bloat and table bloat monitoring: pg_bloat_check or pgstattuple extension identifies tables/indexes where >20% of space is dead tuples; schedule targeted VACUUM FULL off-hours.

Real-World Example

GitLab uses PgBouncer in transaction mode in front of their PostgreSQL primaries, sustaining 40k+ concurrent application connections with only ~400 actual server connections. Notion hit a catastrophic connection storm when deploying a new service version that opened connections without pooling; their incident post-mortem led to mandatory PgBouncer for all PostgreSQL access.

←PreviousEvent Streaming as Storage NextVector Databases→