Observability & Operations
Logging, metrics, tracing, incident management, and chaos engineering
Three PillarsLogs, metrics, distributed traces; correlation IDs across pillars›Structured LoggingJSON logs, log levels, contextual fields, PII redaction›Log AggregationELK Stack, Loki + Grafana, CloudWatch Logs, Datadog›MetricsCounters, gauges, histograms, summaries; RED method, USE method›Prometheus & GrafanaScrape model, PromQL, recording rules, alerting rules, dashboard design›Cloud-Native MonitoringCloudWatch, Azure Monitor, Google Cloud Operations Suite›Distributed TracingOpenTelemetry, Jaeger, Zipkin, AWS X-Ray, Datadog APM; sampling strategies›SLI / SLO / SLADefining meaningful SLIs, setting realistic SLOs, error budget burn rate alerts›Alerting StrategySLO-based alerting, alert fatigue reduction, runbooks, on-call rotation›Health Checks & SyntheticLiveness vs readiness probes, uptime monitoring, canary transactions›Incident ManagementOn-call tooling (PagerDuty, OpsGenie), severity classification, war rooms›Blameless PostmortemsTimeline reconstruction, contributing factors, action items, cultural safety›Chaos EngineeringFault injection, Chaos Monkey, GameDays, blast radius control›FinOps & Cost ObservabilityCost allocation tags, anomaly detection, rightsizing, budget alerts›APMEnd-to-end transaction tracing, dependency maps, error tracking›Dashboard DesignGolden signals, runbook links, audience-aware views (on-call vs exec vs dev)›