Metrics
Counters, gauges, histograms, summaries; RED method, USE method
Metrics are numeric measurements sampled over time, stored as time-series data, and the foundation of dashboards and alerting at scale. Four core metric types cover all use cases: counters (monotonically increasing, e.g., request count), gauges (point-in-time value, e.g., memory used), histograms (bucketed distributions for latency percentiles), and summaries (client-side quantile calculation). Two frameworks guide which metrics to instrument: the RED method (Rate, Errors, Duration — for request-driven services) and the USE method (Utilization, Saturation, Errors — for resource-driven systems like CPUs and queues).
Key Points
- Counter: always-increasing integer; use rate() in PromQL to get per-second rate; never use a counter for values that can decrease
- Gauge: snapshot value that can go up or down; suitable for queue depth, active connections, memory usage
- Histogram: pre-defined buckets (e.g., 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.5s, 1s); enables server-side quantile calculation with histogram_quantile(); preferred over summaries in distributed systems
- Summary: client-side quantile streaming (e.g., p50, p90, p99); cannot be aggregated across instances — avoid for horizontally scaled services
- RED method targets: Rate (requests/sec), Errors (error rate %), Duration (latency histogram); apply per service endpoint
- USE method targets: Utilization (% busy), Saturation (queue/wait depth), Errors (hardware/driver errors); apply per resource (CPU, disk, network, DB connection pool)
- Cardinality explosion: each unique combination of label values creates a new time series; high-cardinality labels (user_id, request_id) can OOM Prometheus; keep label cardinality under ~1000 per metric
- Exemplars (OpenMetrics extension) attach a trace_id to a specific histogram observation, enabling direct navigation from a p99 spike to the offending trace
| Aspect | RED Method | USE Method |
|---|---|---|
| Focus | Request-driven microservices | Infrastructure resources (CPU, disk, network, pools) |
| R / U | Rate — requests per second per endpoint | Utilization — % time resource is busy |
| E | Errors — % of requests returning 5xx / failures | Errors — device errors, dropped packets, driver faults |
| D / S | Duration — latency distribution (p50/p95/p99) | Saturation — queue depth, wait time, pending tasks |
| Primary Tool | Prometheus + Grafana (http_requests_total histogram) | node_exporter, cAdvisor, cloud-native metrics |
| Alert Example | Error rate > 1% for 5 min on /api/checkout | CPU utilization > 80% sustained for 10 min |
| Blind Spot | Does not reveal why (resource constraint) | Does not reveal user-facing impact directly |
| Best Used By | On-call SRE during active incident | Capacity planners and infrastructure engineers |
Real-World Example
Google SRE pioneered the Four Golden Signals (latency, traffic, errors, saturation) which directly inspired the RED and USE methods; every Google service dashboard is structured around these signals before any custom metrics are added.