Observability is the ability to infer a system's internal state from its external outputs — defined by three pillars: logs (discrete events), metrics (numeric time-series aggregates), and distributed traces (request flow across services). Correlation IDs are injected at the ingress point and propagated through every service call, enabling a single user request to be linked across all three pillars. Tools like OpenTelemetry provide a vendor-neutral SDK that captures all three signal types with a unified instrumentation model. Together they answer "what happened?" (logs), "how often and how bad?" (metrics), and "where did latency come from?" (traces).

Key Points

  • Logs capture discrete, timestamped events with arbitrary payload; best for debugging specific incidents
  • Metrics are pre-aggregated numeric values (counters, gauges, histograms) sampled at regular intervals; efficient at scale for alerting and dashboards
  • Distributed traces record causal chains of spans across service boundaries, enabling latency attribution and dependency mapping
  • Correlation IDs (trace IDs) must be generated at the outermost entry point (API gateway or load balancer) and forwarded via HTTP headers (e.g., X-Trace-ID or W3C traceparent)
  • OpenTelemetry (CNCF) provides a single instrumentation layer exporting to any backend (Jaeger, Grafana Tempo, Datadog, New Relic) via OTLP
  • The three pillars complement each other: a metric alert fires → you open logs for that time window → trace IDs in logs take you to the full request trace
  • Exemplars link a specific metric data point to a trace ID, enabling one-click drill-down from a histogram spike to the offending trace
  • Cardinality is the primary cost driver: traces and logs are high-cardinality, so tail-based sampling and log-level filtering are essential for cost control
Three Pillars of Observability LOGS 2024-01-15 ERROR payment failed trace_id: abc123 user_id: usr_**** Discrete Events Unstructured / JSON METRICS http_requests_total Time-Series Numeric Counter / Gauge / Histogram TRACES api-gateway 48ms order-svc 32ms payment-svc 18ms db-query 9ms Request Causality Spans & Timing CORRELATION LAYER trace_id / correlation_id propagated via W3C traceparent header

Three pillars of observability unified by correlation IDs propagated across every service boundary

Real-World Example

At Netflix, a single user-facing error triggers an alert from a metrics anomaly, engineers open Kibana logs filtered by trace_id, then jump to Zipkin to see the full request waterfall — all in under 60 seconds because correlation IDs are enforced at the Zuul gateway.