Observability | Non-Functional Requirements | System Design

Observability is the ability to infer the internal state of a system from its external outputs — the three pillars are structured logs (discrete events), metrics (aggregated measurements over time), and distributed traces (end-to-end request journeys). Unlike monitoring (alerting on known failure modes), observability supports answering questions you didn't think to ask before an incident. The OpenTelemetry project provides a vendor-neutral SDK for instrumenting all three signals. Modern observability platforms (Datadog, Honeycomb, Grafana Stack) correlate all three signals through a common trace ID, enabling sub-minute root cause identification in complex microservice topologies.

Key Points

Structured logs (JSON) are machine-parseable; include trace_id, span_id, service, user_id, and request_id in every log line for correlation.
RED method for service metrics: Rate (requests/sec), Errors (error rate %), Duration (latency percentiles) — sufficient to detect virtually all service degradation.
USE method for infrastructure: Utilization, Saturation, Errors — applies to CPU, memory, disk I/O, network, and connection pools.
Distributed tracing propagates a trace context (W3C TraceContext or B3 headers) across service boundaries, allowing reconstruction of full call graphs.
Sampling strategy: head-based sampling (Jaeger default, random %) loses tail-latency traces; tail-based sampling (Honeycomb) keeps slow/error traces — critical for P99 debugging.
SLO-based alerting fires on error budget burn rate (e.g., burning 5% of monthly budget in 1 hour) rather than raw metric thresholds — reduces false positives significantly.
Cardinality is the observability tax: high-cardinality labels (user_id on metrics) cause storage explosion in time-series DBs; use traces for high-cardinality, metrics for aggregates.
OpenTelemetry Collector as a sidecar decouples instrumentation from vendor backend — enables switching from Jaeger to Honeycomb without code changes.

Real-World Example

Slack's 2022 outage postmortem revealed that trace correlation between their Vitess database layer and application services reduced mean time to diagnosis from 45 minutes to under 8 minutes compared to similar incidents before distributed tracing adoption — observability ROI is measured in incident duration.

←PreviousMaintainability NextCompliance→