Database reliability encompasses the practices that keep data services available, recoverable, and performant under failure conditions. Connection pooling (PgBouncer, HikariCP, ProxySQL) is the first line of defense against connection exhaustion — PostgreSQL pays ~5 MB RAM and fork overhead per connection. Point-in-time recovery (PITR) combines full base backups with continuous WAL archival, enabling restoration to any second within the retention window. Automated failover (Patroni, AWS RDS Multi-AZ, Aurora Global Database) promotes a replica to primary within 15–60 seconds of detecting primary failure.

Key Points

  • Connection pooling modes: session (one server connection per client session — compatible but inefficient), transaction (server connection released after each transaction — most common), statement (released after each statement — incompatible with PREPARE/LISTEN).
  • PgBouncer in transaction mode supports 10,000+ client connections multiplexed onto 50–100 server connections; HikariCP (Java) maintains a pool per application instance with configurable min/max size.
  • Backup strategies: logical (pg_dump, mysqldump — portable, slow for large DBs), physical (pg_basebackup, Percona XtraBackup — fast, version-specific), continuous WAL archival (WAL-G, pgBackRest) for PITR.
  • RTO target drives backup strategy: pg_dump restore of 1 TB takes 2–4 hours (high RTO); Aurora PITR can restore to any second in 5–15 minutes using automated snapshot + binlog replay.
  • AWS RDS Multi-AZ uses synchronous replication to a standby in another AZ; failover is automatic (1–2 minutes) via DNS CNAME flip — your application must handle connection resets.
  • Aurora Global Database replicates with <1 second lag to secondary regions; promotes secondary to primary in <1 minute for cross-region DR scenarios.
  • Read-replica promotion is not instantaneous: the replica must first apply all pending WAL/binlog before accepting writes — monitor replica lag to estimate promotion time.
  • Vacuum bloat and table bloat monitoring: pg_bloat_check or pgstattuple extension identifies tables/indexes where >20% of space is dead tuples; schedule targeted VACUUM FULL off-hours.

Real-World Example

GitLab uses PgBouncer in transaction mode in front of their PostgreSQL primaries, sustaining 40k+ concurrent application connections with only ~400 actual server connections. Notion hit a catastrophic connection storm when deploying a new service version that opened connections without pooling; their incident post-mortem led to mandatory PgBouncer for all PostgreSQL access.