Health checks enable orchestration platforms to manage service lifecycle intelligently, distinguishing between a process that is alive, one that is ready to serve traffic, and one that is still initializing. Kubernetes defines three probe types: liveness (restart if dead), readiness (remove from load balancer if not ready), and startup (disable liveness/readiness during slow initialization). Beyond Kubernetes, synthetic monitoring (uptime monitoring via Pingdom, Checkly, or CloudWatch Synthetics) tests actual user flows from external locations, catching DNS, CDN, and certificate issues that internal probes miss. Canary transactions continuously execute critical business operations (e.g., a test checkout) to validate end-to-end functionality.

Key Points

  • Liveness probe failure causes Kubernetes to kill and restart the container; use for deadlock detection, infinite loop detection — NOT for dependency failures
  • Readiness probe failure removes the pod from Service endpoints (load balancer); use when a dependency (DB, cache) is unavailable and the pod cannot serve traffic
  • Startup probe disables liveness and readiness checks until the application signals it has started; essential for JVM-based services with 30–90 second startup times
  • Health endpoint (/health or /healthz) should return 200 OK with a JSON body listing dependency states; never return 200 if a critical dependency is down
  • Shallow vs deep health checks: shallow = process alive; deep = database connected, cache reachable, message bus available; expose both at /health/live and /health/ready
  • Uptime monitoring: external agents (Pingdom, Checkly, UptimeRobot) test from multiple global regions every 30–60 seconds; catches regional DNS and CDN outages
  • Canary transactions: synthetic scripts that perform real (but low-impact) business operations on production at 1–5 minute intervals; the gold standard for user-journey availability
  • Health check timeouts: set probe timeout < than the probe period; a 5-second timeout with a 10-second period and 3 failure threshold = 30 seconds to trigger a restart
Probe TypePurposeFailure ActionTypical CheckCommon Mistake
LivenessIs the process alive and not deadlocked?Kubernetes restarts the containerGET /healthz returns 200; simple in-process checkIncluding DB connectivity — causes restart cascades when DB is down
ReadinessCan the pod safely serve production traffic?Pod removed from Service endpoints (no restarts)GET /ready; verifies DB, cache, and feature-flag service are reachableForgetting to add readiness probe — unhealthy pods receive traffic during startup
StartupHas the application finished its initialization sequence?Liveness/readiness disabled until startup succeedsGET /healthz; polled every 5s for up to 5 minutes (failureThreshold × periodSeconds)Not using startup probe for slow JVM apps — liveness probe kills them before init completes
External UptimeIs the service reachable from external networks (user perspective)?PagerDuty alert; status page incident createdHTTPS GET to public endpoint from 5 global PoPs every 60sTesting only internal endpoints — misses CDN, DNS, and TLS certificate expiry failures
Canary TransactionIs the full business flow working end-to-end?P1/P2 alert; error budget consumedAutomated checkout flow with test card every 5 minutes; payment confirmation verifiedRunning against staging only — production canaries catch config and data-specific failures

Real-World Example

Amazon uses synthetic canary transactions (CloudWatch Synthetics) for every customer-facing API; each canary runs a scripted Selenium/Puppeteer workflow against production every minute and is the primary mechanism for detecting regressions that unit tests miss.