Distributed Tracing | Observability & Operations | System Design

Distributed tracing tracks a single request as it flows through multiple services, recording each operation as a span nested within a root trace. OpenTelemetry (OTel) is now the industry standard for instrumentation, providing auto-instrumentation agents for Java, Python, Node.js, Go, and .NET that capture spans without code changes. Backends like Jaeger, Zipkin, and AWS X-Ray store and visualize traces; Grafana Tempo offers cost-effective trace storage backed by object storage. Sampling strategies are critical: head-based sampling (decide at ingress) reduces volume uniformly, while tail-based sampling (decide after the fact) retains 100% of slow and error traces at the cost of a buffering layer.

Key Points

Trace: a directed acyclic graph (DAG) of spans representing the entire lifecycle of one request; identified by a globally unique trace_id (128-bit)
Span: a single unit of work with a name, start/end timestamps, status (OK/ERROR), and attributes (key-value pairs); child spans linked to parent via parent_span_id
W3C traceparent header (00-{trace_id}-{parent_span_id}-{flags}) is the standard propagation format; replace legacy B3/X-B3 headers in new services
OpenTelemetry Collector: a vendor-neutral proxy that receives spans (OTLP, Zipkin, Jaeger), transforms (attribute enrichment, sampling), and exports to multiple backends simultaneously
Head-based sampling: 10% uniform sampling at the gateway; simple but loses rare error traces proportionally; use probabilistic or rate-limited samplers
Tail-based sampling: buffer all spans in memory for ~30 seconds, then decide to keep or drop based on the completed trace's outcome (error, slow, high-error-rate service); implemented by OTel Collector tail sampling processor
Grafana Tempo + Loki integration: trace IDs in log lines are auto-linked to Tempo traces in Grafana; zero extra infrastructure for log-to-trace correlation
Instrumentation priority: auto-instrument HTTP clients and servers, database drivers, and message queue producers/consumers first — these cover 80% of cross-service latency

Real-World Example

Uber's distributed tracing system (Jaeger, which they open-sourced) handles over 1 billion spans per day; they pioneered adaptive tail-based sampling that dynamically adjusts sample rates per service based on observed error rates, reducing trace storage by 95% while retaining all error traces.

←PreviousCloud-Native Monitoring NextSLI / SLO / SLA→