Data Warehousing | Databases & Data Architecture | System Design

A data warehouse is a centralized analytical store that integrates data from multiple operational sources into a consistent, historical, and queryable schema. Modern cloud data warehouses — Amazon Redshift, Google BigQuery, and Snowflake — use columnar storage, vectorized query execution, and MPP to deliver sub-second query performance on petabyte-scale datasets. The dominant modeling approach is the star schema (fact table surrounded by dimension tables) or its normalized variant, the snowflake schema, both designed to minimize joins during GROUP BY / aggregate queries.

Key Points

Star schema: one central fact table (e.g., fact_sales: order_id, customer_key, product_key, date_key, amount) linked to dimension tables (dim_customer, dim_product, dim_date); queries join fact to 1–4 dimensions.
Snowflake schema normalizes dimension tables (dim_product → dim_category → dim_department); reduces redundancy but adds join depth — most warehouses prefer star for query performance.
Columnar compression in Redshift: ZSTD encoding on high-cardinality columns, AZ64 for numerics; typical compression ratio 3–5x reducing storage cost and I/O.
BigQuery separates storage (Colossus, charged at $20/TB/month) from compute (slots, charged on-demand at $6.25/TB scanned or flat-rate slots); zero infrastructure management.
Snowflake's multi-cluster shared data architecture: storage (S3/Azure Blob/GCS) is shared; compute "virtual warehouses" (VWs) are independent clusters that scale up/down in 2–3 seconds.
Partitioning and clustering: Redshift distribution keys co-locate related rows on the same node for JOIN efficiency; sort keys enable zone maps to skip irrelevant blocks.
BigQuery partitioned tables (by ingestion time or column) + clustered tables (by up to 4 columns) dramatically reduce bytes scanned — critical for controlling cost.
ELT over ETL: modern warehouses are powerful enough to transform data inside the warehouse; dbt (data build tool) manages SQL-based transformations as versioned, tested models.

Real-World Example

Lyft built their data warehouse on Hive, then migrated to Presto + S3 (data lake), and finally standardized on Snowflake for BI; the migration reduced P95 query time from 45 minutes to 4 minutes. Airbnb's "Minerva" metrics layer sits on top of Druid+Hive and provides a consistent definition of metrics (e.g., "bookings") across all BI tools to avoid metric inconsistency.

←PreviousOLTP vs OLAP NextData Lake & Lakehouse→