Dashboard Design | Observability & Operations | System Design

Effective observability dashboards are purpose-built for their audience: an on-call engineer needs actionable signal during a 2am incident; an executive needs SLO trend and error budget remaining; a developer needs service-level throughput, latency, and error breakdown by endpoint. The golden signals framework (latency, traffic, errors, saturation) provides the universal starting point for any service dashboard. Every dashboard panel should link to the relevant runbook, and every alerting rule should link to the dashboard that provides context. Dashboard-as-code (Grafonnet, Jsonnet, Terraform Grafana provider) ensures dashboards are version-controlled and reproducible.

Key Points

Golden signals layout: top row = traffic (req/sec) + error rate + p99 latency + saturation; each panel includes a threshold line at the SLO boundary and a "worst in period" stat
Audience-aware design: executive dashboard = SLO status, error budget remaining, uptime trend (monthly view); on-call dashboard = real-time golden signals, active alerts, recent deploys timeline
Runbook links: embed a link to the relevant runbook in every dashboard panel description; on-call engineers should never have to search for next steps during an incident
Time range defaults: on-call dashboards default to Last 1 hour with 1-minute resolution; capacity planning dashboards default to Last 30 days; always expose a time range selector
Template variables: parameterize dashboards by environment (prod/staging), cluster, region, and service to avoid dashboard sprawl; one dashboard per service type, not per instance
Avoid dashboard decay: assign ownership; review quarterly; retire dashboards unused in 60 days; Grafana usage analytics (Grafana Cloud) shows which dashboards are actually opened during incidents
Dashboard-as-code: Grafonnet (Jsonnet library) or the Grafana Terraform provider generate dashboards from code; stored in Git, reviewed in PRs, deployed via CI/CD alongside the service
USE/RED panel naming conventions: label panels with the signal they represent ("HTTP Error Rate %" not "metric_name"); include units in panel titles (ms, req/s, %); use consistent color coding (green=good, red=bad)

Real-World Example

Grafana Labs published the "Grafana Dashboard Best Practices" guide based on analysis of thousands of community dashboards; their key finding was that 78% of on-call engineers open fewer than 3 dashboards during an incident — the ones they open are always the ones linked directly from the alert.

←PreviousAPM