Health Checks & Synthetic | Observability & Operations | System Design

Health checks enable orchestration platforms to manage service lifecycle intelligently, distinguishing between a process that is alive, one that is ready to serve traffic, and one that is still initializing. Kubernetes defines three probe types: liveness (restart if dead), readiness (remove from load balancer if not ready), and startup (disable liveness/readiness during slow initialization). Beyond Kubernetes, synthetic monitoring (uptime monitoring via Pingdom, Checkly, or CloudWatch Synthetics) tests actual user flows from external locations, catching DNS, CDN, and certificate issues that internal probes miss. Canary transactions continuously execute critical business operations (e.g., a test checkout) to validate end-to-end functionality.

Key Points

Liveness probe failure causes Kubernetes to kill and restart the container; use for deadlock detection, infinite loop detection — NOT for dependency failures
Readiness probe failure removes the pod from Service endpoints (load balancer); use when a dependency (DB, cache) is unavailable and the pod cannot serve traffic
Startup probe disables liveness and readiness checks until the application signals it has started; essential for JVM-based services with 30–90 second startup times
Health endpoint (/health or /healthz) should return 200 OK with a JSON body listing dependency states; never return 200 if a critical dependency is down
Shallow vs deep health checks: shallow = process alive; deep = database connected, cache reachable, message bus available; expose both at /health/live and /health/ready
Uptime monitoring: external agents (Pingdom, Checkly, UptimeRobot) test from multiple global regions every 30–60 seconds; catches regional DNS and CDN outages
Canary transactions: synthetic scripts that perform real (but low-impact) business operations on production at 1–5 minute intervals; the gold standard for user-journey availability
Health check timeouts: set probe timeout < than the probe period; a 5-second timeout with a 10-second period and 3 failure threshold = 30 seconds to trigger a restart

Probe Type	Purpose	Failure Action	Typical Check	Common Mistake
Liveness	Is the process alive and not deadlocked?	Kubernetes restarts the container	GET /healthz returns 200; simple in-process check	Including DB connectivity — causes restart cascades when DB is down
Readiness	Can the pod safely serve production traffic?	Pod removed from Service endpoints (no restarts)	GET /ready; verifies DB, cache, and feature-flag service are reachable	Forgetting to add readiness probe — unhealthy pods receive traffic during startup
Startup	Has the application finished its initialization sequence?	Liveness/readiness disabled until startup succeeds	GET /healthz; polled every 5s for up to 5 minutes (failureThreshold × periodSeconds)	Not using startup probe for slow JVM apps — liveness probe kills them before init completes
External Uptime	Is the service reachable from external networks (user perspective)?	PagerDuty alert; status page incident created	HTTPS GET to public endpoint from 5 global PoPs every 60s	Testing only internal endpoints — misses CDN, DNS, and TLS certificate expiry failures
Canary Transaction	Is the full business flow working end-to-end?	P1/P2 alert; error budget consumed	Automated checkout flow with test card every 5 minutes; payment confirmation verified	Running against staging only — production canaries catch config and data-specific failures

Real-World Example

Amazon uses synthetic canary transactions (CloudWatch Synthetics) for every customer-facing API; each canary runs a scripted Selenium/Puppeteer workflow against production every minute and is the primary mechanism for detecting regressions that unit tests miss.

←PreviousAlerting Strategy NextIncident Management→