Data Pipelines & ETL/ELT
Batch vs streaming, Apache Spark, Flink, dbt, Airflow
Data pipelines move, transform, and load data between systems, either in scheduled batch windows or continuously in streaming mode. Apache Spark dominates large-scale batch processing with its in-memory DAG execution engine; Apache Flink leads stateful stream processing with exactly-once semantics and event-time windowing. dbt (data build tool) has become the standard for SQL-based analytical transformations inside data warehouses, while Apache Airflow remains the dominant DAG-based workflow orchestrator for scheduling and dependency management.
Key Points
- Batch pipelines: process data in bounded chunks (hourly, daily); simpler to debug but introduce latency; suitable when downstream consumers tolerate stale data.
- Streaming pipelines: process events within milliseconds of arrival; complex windowing and state management; required for real-time dashboards, fraud detection, and alerting.
- Apache Spark: RDD → DataFrame → Dataset API; Catalyst optimizer rewrites logical plans; Tungsten engine uses off-heap memory and code generation for vectorized execution.
- Apache Flink: true streaming (not micro-batch); supports event-time processing with watermarks to handle out-of-order events; RocksDB-backed state for large stateful operators.
- dbt models are SQL SELECT statements; dbt compiles them into CREATE TABLE AS SELECT or CREATE VIEW statements; lineage graph is auto-generated from ref() calls between models.
- Apache Airflow DAGs are Python code defining tasks and dependencies; scheduler uses a database (PostgreSQL/MySQL) for state; executors: LocalExecutor, CeleryExecutor, KubernetesExecutor.
- ELT vs ETL: modern approach loads raw data first into the warehouse, then transforms in-place using Spark or dbt — avoids complex pre-load transformation and preserves raw data for reprocessing.
- Data quality: Great Expectations, Soda, or dbt tests validate column nulls, uniqueness, referential integrity, and value ranges; failed tests can halt pipelines before bad data propagates.
Real-World Example
Airbnb open-sourced Apache Airflow (originally called Chronos internally) after building it to orchestrate their ML and analytics pipelines. Stripe uses a combination of Spark (batch risk scoring), Flink (real-time fraud signals), and dbt (BI-layer metrics) — three layers serving different freshness requirements from the same raw event data.