Health Checks & Synthetic
Liveness vs readiness probes, uptime monitoring, canary transactions
Health checks enable orchestration platforms to manage service lifecycle intelligently, distinguishing between a process that is alive, one that is ready to serve traffic, and one that is still initializing. Kubernetes defines three probe types: liveness (restart if dead), readiness (remove from load balancer if not ready), and startup (disable liveness/readiness during slow initialization). Beyond Kubernetes, synthetic monitoring (uptime monitoring via Pingdom, Checkly, or CloudWatch Synthetics) tests actual user flows from external locations, catching DNS, CDN, and certificate issues that internal probes miss. Canary transactions continuously execute critical business operations (e.g., a test checkout) to validate end-to-end functionality.
Key Points
- Liveness probe failure causes Kubernetes to kill and restart the container; use for deadlock detection, infinite loop detection — NOT for dependency failures
- Readiness probe failure removes the pod from Service endpoints (load balancer); use when a dependency (DB, cache) is unavailable and the pod cannot serve traffic
- Startup probe disables liveness and readiness checks until the application signals it has started; essential for JVM-based services with 30–90 second startup times
- Health endpoint (/health or /healthz) should return 200 OK with a JSON body listing dependency states; never return 200 if a critical dependency is down
- Shallow vs deep health checks: shallow = process alive; deep = database connected, cache reachable, message bus available; expose both at /health/live and /health/ready
- Uptime monitoring: external agents (Pingdom, Checkly, UptimeRobot) test from multiple global regions every 30–60 seconds; catches regional DNS and CDN outages
- Canary transactions: synthetic scripts that perform real (but low-impact) business operations on production at 1–5 minute intervals; the gold standard for user-journey availability
- Health check timeouts: set probe timeout < than the probe period; a 5-second timeout with a 10-second period and 3 failure threshold = 30 seconds to trigger a restart
| Probe Type | Purpose | Failure Action | Typical Check | Common Mistake |
|---|---|---|---|---|
| Liveness | Is the process alive and not deadlocked? | Kubernetes restarts the container | GET /healthz returns 200; simple in-process check | Including DB connectivity — causes restart cascades when DB is down |
| Readiness | Can the pod safely serve production traffic? | Pod removed from Service endpoints (no restarts) | GET /ready; verifies DB, cache, and feature-flag service are reachable | Forgetting to add readiness probe — unhealthy pods receive traffic during startup |
| Startup | Has the application finished its initialization sequence? | Liveness/readiness disabled until startup succeeds | GET /healthz; polled every 5s for up to 5 minutes (failureThreshold × periodSeconds) | Not using startup probe for slow JVM apps — liveness probe kills them before init completes |
| External Uptime | Is the service reachable from external networks (user perspective)? | PagerDuty alert; status page incident created | HTTPS GET to public endpoint from 5 global PoPs every 60s | Testing only internal endpoints — misses CDN, DNS, and TLS certificate expiry failures |
| Canary Transaction | Is the full business flow working end-to-end? | P1/P2 alert; error budget consumed | Automated checkout flow with test card every 5 minutes; payment confirmation verified | Running against staging only — production canaries catch config and data-specific failures |
Real-World Example
Amazon uses synthetic canary transactions (CloudWatch Synthetics) for every customer-facing API; each canary runs a scripted Selenium/Puppeteer workflow against production every minute and is the primary mechanism for detecting regressions that unit tests miss.