Three Pillars | Observability & Operations | System Design

Observability is the ability to infer a system's internal state from its external outputs — defined by three pillars: logs (discrete events), metrics (numeric time-series aggregates), and distributed traces (request flow across services). Correlation IDs are injected at the ingress point and propagated through every service call, enabling a single user request to be linked across all three pillars. Tools like OpenTelemetry provide a vendor-neutral SDK that captures all three signal types with a unified instrumentation model. Together they answer "what happened?" (logs), "how often and how bad?" (metrics), and "where did latency come from?" (traces).

Key Points

Logs capture discrete, timestamped events with arbitrary payload; best for debugging specific incidents
Metrics are pre-aggregated numeric values (counters, gauges, histograms) sampled at regular intervals; efficient at scale for alerting and dashboards
Distributed traces record causal chains of spans across service boundaries, enabling latency attribution and dependency mapping
Correlation IDs (trace IDs) must be generated at the outermost entry point (API gateway or load balancer) and forwarded via HTTP headers (e.g., X-Trace-ID or W3C traceparent)
OpenTelemetry (CNCF) provides a single instrumentation layer exporting to any backend (Jaeger, Grafana Tempo, Datadog, New Relic) via OTLP
The three pillars complement each other: a metric alert fires → you open logs for that time window → trace IDs in logs take you to the full request trace
Exemplars link a specific metric data point to a trace ID, enabling one-click drill-down from a histogram spike to the offending trace
Cardinality is the primary cost driver: traces and logs are high-cardinality, so tail-based sampling and log-level filtering are essential for cost control

Three pillars of observability unified by correlation IDs propagated across every service boundary

Real-World Example

At Netflix, a single user-facing error triggers an alert from a metrics anomaly, engineers open Kibana logs filtered by trace_id, then jump to Zipkin to see the full request waterfall — all in under 60 seconds because correlation IDs are enforced at the Zuul gateway.

NextStructured Logging→