Database Selection | Databases & Data Architecture | System Design

Choosing a database engine is one of the most consequential architecture decisions, because migrations at scale are extraordinarily expensive. The right choice depends on four axes: consistency requirements (ACID vs eventual), query patterns (point lookups vs aggregations vs graph traversals), scale envelope (rows, throughput, data volume), and operational cost (managed vs self-hosted, licensing). No single engine wins every dimension — the expert move is to use polyglot persistence, matching each data domain to the engine best suited for it.

Key Points

Start with PostgreSQL as the default — it handles relational, JSONB semi-structured, full-text, and geospatial workloads; only switch when you hit a documented limitation.
If your access pattern is 100% key lookups at >100k RPS with no ad-hoc queries, move to DynamoDB or Redis; provisioned throughput is cheaper than over-indexing Postgres.
Cassandra wins for multi-region active-active, write-heavy time-series (>100k writes/sec) where last-write-wins or application-level conflict resolution is acceptable.
Graph use cases (fraud rings, recommendation, network topology) require a graph DB once join depth exceeds 3–4 hops; SQL performance degrades exponentially with recursive CTEs.
Operational complexity increases sharply with exotic engines — DynamoDB is operationally zero-overhead; self-managed Cassandra requires deep expertise for compaction tuning and repair.
Data volume determines storage engine: columnar stores (Redshift, BigQuery, Snowflake, ClickHouse) outperform row-stores by 10–100x on aggregation-heavy OLAP workloads.
Evaluate SLAs: Aurora Multi-AZ gives 99.99% availability; DynamoDB guarantees 99.999% for global tables — compare against your RTO/RPO requirements.
Licensing cost: MySQL Community vs Percona XtraDB Cluster vs AWS Aurora MySQL — Aurora costs ~2–3x raw EC2+EBS but eliminates storage management complexity.

Real-World Example

Pinterest uses MySQL for core social graph (pins, boards, users) and HBase for large-scale analytics; Redis for caching and real-time feeds. Dropbox evaluated many options and chose MySQL for reliability, using MyRocks (RocksDB storage engine) for 2x compression on cold data. GitHub uses MySQL for core metadata, Redis for job queues, and Elasticsearch for code search.

←PreviousNoSQL Databases NextData Modeling→