Event Streaming as Storage | Databases & Data Architecture | System Design

Apache Kafka is often treated purely as a messaging system, but its durable, ordered, and replayable log makes it a powerful storage layer for event-sourced systems. Log compaction retains only the latest value per key within a partition, enabling Kafka to serve as a changelog store (materialized view over a stream). The Schema Registry (Confluent) enforces Avro, Protobuf, or JSON Schema compatibility on topics, preventing consumer breakage as producers evolve their event schemas.

Key Points

Kafka log compaction: the log cleaner periodically scans partitions and removes records with a key that has a newer value in the same partition — the log becomes an ever-current snapshot.
Compact topics have infinite retention for the latest value per key; useful for materializing current state (e.g., user profile topic where key=user_id, value=current profile JSON).
Event sourcing on Kafka: every state change is an immutable event appended to a topic; current state is derived by replaying events from offset 0 — complete audit trail by design.
Consumer offset management: consumers commit their read position (offset) to __consumer_offsets topic or externally; on restart, they resume from the last committed offset.
Exactly-once semantics in Kafka (EOS): producer idempotency (deduplication by sequence number) + transactional API (atomic write across multiple partitions/topics) = EOS end-to-end.
Schema Registry stores schemas by subject (topic-key or topic-value); compatibility levels: BACKWARD (new schema can read old data), FORWARD (old consumers can read new data), FULL (both).
Kafka Streams (KStream/KTable API) enables stateful stream processing within the Kafka ecosystem: joins, windowed aggregations, and changelog-backed state stores without a separate processing cluster.
Tiered Storage (Kafka 3.6+) offloads cold log segments to S3/GCS, enabling virtually unlimited retention at object-storage cost — critical for event-sourced systems that need indefinite history.

Real-World Example

LinkedIn built Kafka to solve internal data pipeline fragmentation — today their clusters handle 7+ trillion messages per day. Zalando uses Kafka as the single source of truth for order state: each order event is appended to a compacted order topic; downstream services (inventory, shipping, finance) replay from the beginning to rebuild their own local state.

←PreviousData Pipelines & ETL/ELT NextDatabase Reliability→