Metrics | Observability & Operations | System Design

Metrics are numeric measurements sampled over time, stored as time-series data, and the foundation of dashboards and alerting at scale. Four core metric types cover all use cases: counters (monotonically increasing, e.g., request count), gauges (point-in-time value, e.g., memory used), histograms (bucketed distributions for latency percentiles), and summaries (client-side quantile calculation). Two frameworks guide which metrics to instrument: the RED method (Rate, Errors, Duration — for request-driven services) and the USE method (Utilization, Saturation, Errors — for resource-driven systems like CPUs and queues).

Key Points

Counter: always-increasing integer; use rate() in PromQL to get per-second rate; never use a counter for values that can decrease
Gauge: snapshot value that can go up or down; suitable for queue depth, active connections, memory usage
Histogram: pre-defined buckets (e.g., 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.5s, 1s); enables server-side quantile calculation with histogram_quantile(); preferred over summaries in distributed systems
Summary: client-side quantile streaming (e.g., p50, p90, p99); cannot be aggregated across instances — avoid for horizontally scaled services
RED method targets: Rate (requests/sec), Errors (error rate %), Duration (latency histogram); apply per service endpoint
USE method targets: Utilization (% busy), Saturation (queue/wait depth), Errors (hardware/driver errors); apply per resource (CPU, disk, network, DB connection pool)
Cardinality explosion: each unique combination of label values creates a new time series; high-cardinality labels (user_id, request_id) can OOM Prometheus; keep label cardinality under ~1000 per metric
Exemplars (OpenMetrics extension) attach a trace_id to a specific histogram observation, enabling direct navigation from a p99 spike to the offending trace

Aspect	RED Method	USE Method
Focus	Request-driven microservices	Infrastructure resources (CPU, disk, network, pools)
R / U	Rate — requests per second per endpoint	Utilization — % time resource is busy
E	Errors — % of requests returning 5xx / failures	Errors — device errors, dropped packets, driver faults
D / S	Duration — latency distribution (p50/p95/p99)	Saturation — queue depth, wait time, pending tasks
Primary Tool	Prometheus + Grafana (http_requests_total histogram)	node_exporter, cAdvisor, cloud-native metrics
Alert Example	Error rate > 1% for 5 min on /api/checkout	CPU utilization > 80% sustained for 10 min
Blind Spot	Does not reveal why (resource constraint)	Does not reveal user-facing impact directly
Best Used By	On-call SRE during active incident	Capacity planners and infrastructure engineers

Real-World Example

Google SRE pioneered the Four Golden Signals (latency, traffic, errors, saturation) which directly inspired the RED and USE methods; every Google service dashboard is structured around these signals before any custom metrics are added.

←PreviousLog Aggregation NextPrometheus & Grafana→