Observability
Structured logging, metrics, distributed tracing — the three pillars
Observability is the ability to infer the internal state of a system from its external outputs — the three pillars are structured logs (discrete events), metrics (aggregated measurements over time), and distributed traces (end-to-end request journeys). Unlike monitoring (alerting on known failure modes), observability supports answering questions you didn't think to ask before an incident. The OpenTelemetry project provides a vendor-neutral SDK for instrumenting all three signals. Modern observability platforms (Datadog, Honeycomb, Grafana Stack) correlate all three signals through a common trace ID, enabling sub-minute root cause identification in complex microservice topologies.
Key Points
- Structured logs (JSON) are machine-parseable; include trace_id, span_id, service, user_id, and request_id in every log line for correlation.
- RED method for service metrics: Rate (requests/sec), Errors (error rate %), Duration (latency percentiles) — sufficient to detect virtually all service degradation.
- USE method for infrastructure: Utilization, Saturation, Errors — applies to CPU, memory, disk I/O, network, and connection pools.
- Distributed tracing propagates a trace context (W3C TraceContext or B3 headers) across service boundaries, allowing reconstruction of full call graphs.
- Sampling strategy: head-based sampling (Jaeger default, random %) loses tail-latency traces; tail-based sampling (Honeycomb) keeps slow/error traces — critical for P99 debugging.
- SLO-based alerting fires on error budget burn rate (e.g., burning 5% of monthly budget in 1 hour) rather than raw metric thresholds — reduces false positives significantly.
- Cardinality is the observability tax: high-cardinality labels (user_id on metrics) cause storage explosion in time-series DBs; use traces for high-cardinality, metrics for aggregates.
- OpenTelemetry Collector as a sidecar decouples instrumentation from vendor backend — enables switching from Jaeger to Honeycomb without code changes.
Real-World Example
Slack's 2022 outage postmortem revealed that trace correlation between their Vitess database layer and application services reduced mean time to diagnosis from 45 minutes to under 8 minutes compared to similar incidents before distributed tracing adoption — observability ROI is measured in incident duration.