Prometheus & Grafana | Observability & Operations | System Design

Prometheus is a pull-based time-series database that scrapes metrics from HTTP endpoints (/metrics) at configurable intervals, storing data with a 15-second default resolution and a 15-day default retention on local disk. Its query language, PromQL, enables ad-hoc computation of rates, ratios, and quantiles without pre-aggregation. Recording rules pre-compute expensive PromQL expressions and store the result as new time series, reducing query latency for dashboards. Grafana connects to Prometheus (and 50+ other datasources) to render panels, while Alertmanager handles deduplication, grouping, silencing, and routing of alerts to PagerDuty, Slack, or OpsGenie.

Key Points

Scrape model: Prometheus pulls from /metrics endpoints; target discovery via static config, Kubernetes SD (ServiceMonitor CRDs via kube-prometheus-stack), Consul, or EC2 tags
PromQL functions: rate() for counters over a range, increase() for total increase, histogram_quantile() for percentiles, topk() for top-N series, label_replace() for relabeling
Recording rules reduce cardinality and query time for dashboards: pre-compute job:http_requests:rate5m once per scrape interval, query it cheaply in all panels
Alerting rules in Prometheus evaluate PromQL expressions on each scrape; alerts fire after a "for" duration (avoid flapping), then route through Alertmanager
Remote write: Prometheus can stream samples to Cortex, Thanos, Mimir, or VictoriaMetrics for long-term retention and global query across clusters
Thanos / Grafana Mimir enable horizontally scalable, multi-cluster Prometheus with object-storage-backed long-term retention (years at low cost)
Grafana best practices: use template variables for environment/cluster/service, avoid queries returning >10k time series per panel, set panel refresh to match data volatility
Operator-based deployment: kube-prometheus-stack Helm chart deploys Prometheus Operator, Grafana, node-exporter, kube-state-metrics, and pre-built dashboards in one command

PromQL: error-rate burn ratio for SLO alerting, p99 latency from histogram, and a recording rule to pre-aggregate request rate by job

# HTTP error rate over 5 minutes (SLO burn rate)
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# p99 latency
histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket[5m]))

# Recording rule for efficiency
record: job:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (job)

Real-World Example

Weaveworks open-sourced the kube-prometheus-stack (now maintained by the Prometheus Operator community); Spotify runs Thanos-backed Prometheus at petabyte scale, querying metrics across 1,000+ Kubernetes clusters through a single Grafana instance.

←PreviousMetrics NextCloud-Native Monitoring→