Data Pipelines & ETL/ELT | Databases & Data Architecture | System Design

Data pipelines move, transform, and load data between systems, either in scheduled batch windows or continuously in streaming mode. Apache Spark dominates large-scale batch processing with its in-memory DAG execution engine; Apache Flink leads stateful stream processing with exactly-once semantics and event-time windowing. dbt (data build tool) has become the standard for SQL-based analytical transformations inside data warehouses, while Apache Airflow remains the dominant DAG-based workflow orchestrator for scheduling and dependency management.

Key Points

Batch pipelines: process data in bounded chunks (hourly, daily); simpler to debug but introduce latency; suitable when downstream consumers tolerate stale data.
Streaming pipelines: process events within milliseconds of arrival; complex windowing and state management; required for real-time dashboards, fraud detection, and alerting.
Apache Spark: RDD → DataFrame → Dataset API; Catalyst optimizer rewrites logical plans; Tungsten engine uses off-heap memory and code generation for vectorized execution.
Apache Flink: true streaming (not micro-batch); supports event-time processing with watermarks to handle out-of-order events; RocksDB-backed state for large stateful operators.
dbt models are SQL SELECT statements; dbt compiles them into CREATE TABLE AS SELECT or CREATE VIEW statements; lineage graph is auto-generated from ref() calls between models.
Apache Airflow DAGs are Python code defining tasks and dependencies; scheduler uses a database (PostgreSQL/MySQL) for state; executors: LocalExecutor, CeleryExecutor, KubernetesExecutor.
ELT vs ETL: modern approach loads raw data first into the warehouse, then transforms in-place using Spark or dbt — avoids complex pre-load transformation and preserves raw data for reprocessing.
Data quality: Great Expectations, Soda, or dbt tests validate column nulls, uniqueness, referential integrity, and value ranges; failed tests can halt pipelines before bad data propagates.

Real-World Example

Airbnb open-sourced Apache Airflow (originally called Chronos internally) after building it to orchestrate their ML and analytics pipelines. Stripe uses a combination of Spark (batch risk scoring), Flink (real-time fraud signals), and dbt (BI-layer metrics) — three layers serving different freshness requirements from the same raw event data.

←PreviousData Lake & Lakehouse NextEvent Streaming as Storage→