Three Pillars
Logs, metrics, distributed traces; correlation IDs across pillars
Observability is the ability to infer a system's internal state from its external outputs — defined by three pillars: logs (discrete events), metrics (numeric time-series aggregates), and distributed traces (request flow across services). Correlation IDs are injected at the ingress point and propagated through every service call, enabling a single user request to be linked across all three pillars. Tools like OpenTelemetry provide a vendor-neutral SDK that captures all three signal types with a unified instrumentation model. Together they answer "what happened?" (logs), "how often and how bad?" (metrics), and "where did latency come from?" (traces).
Key Points
- Logs capture discrete, timestamped events with arbitrary payload; best for debugging specific incidents
- Metrics are pre-aggregated numeric values (counters, gauges, histograms) sampled at regular intervals; efficient at scale for alerting and dashboards
- Distributed traces record causal chains of spans across service boundaries, enabling latency attribution and dependency mapping
- Correlation IDs (trace IDs) must be generated at the outermost entry point (API gateway or load balancer) and forwarded via HTTP headers (e.g., X-Trace-ID or W3C traceparent)
- OpenTelemetry (CNCF) provides a single instrumentation layer exporting to any backend (Jaeger, Grafana Tempo, Datadog, New Relic) via OTLP
- The three pillars complement each other: a metric alert fires → you open logs for that time window → trace IDs in logs take you to the full request trace
- Exemplars link a specific metric data point to a trace ID, enabling one-click drill-down from a histogram spike to the offending trace
- Cardinality is the primary cost driver: traces and logs are high-cardinality, so tail-based sampling and log-level filtering are essential for cost control
Three pillars of observability unified by correlation IDs propagated across every service boundary
Real-World Example
At Netflix, a single user-facing error triggers an alert from a metrics anomaly, engineers open Kibana logs filtered by trace_id, then jump to Zipkin to see the full request waterfall — all in under 60 seconds because correlation IDs are enforced at the Zuul gateway.