Chaos Engineering | Observability & Operations | System Design

Chaos engineering is the practice of deliberately injecting failures into a system in a controlled way to build confidence in its ability to withstand turbulent production conditions. Netflix pioneered the field with Chaos Monkey (randomly terminates EC2 instances in production) and later the Simian Army. The discipline is now formalized by the Principles of Chaos Engineering (principlesofchaos.org), which requires a steady-state hypothesis, minimal blast radius, and running experiments in production (not just staging). GameDays are structured exercises where teams intentionally trigger failure scenarios to practice incident response.

Key Points

Steady-state hypothesis: define measurable normal behavior (e.g., p99 latency < 200ms, error rate < 0.1%) before injecting failure; if the hypothesis holds post-injection, the system is resilient
Chaos Monkey: Netflix's open-source tool that randomly kills virtual machine instances during business hours, ensuring services are designed to survive instance loss
Failure injection targets: instance termination, network latency injection (tc netem), packet loss, disk full, CPU stress, memory pressure, dependency unavailability
Blast radius control: start with a single pod in a non-peak hour, then expand to a full availability zone; use feature flags to gate chaos experiments
AWS Fault Injection Simulator (FIS), Gremlin, and LitmusChaos (CNCF) provide managed chaos platforms with safety guards, automatic rollback, and experiment templates
GameDays: quarterly structured drills where engineering teams simulate major failure scenarios (AZ failure, database failover, CDN outage) with incident response practice; tabletop version for teams not yet ready for production injection
Chaos in CI: inject latency and errors into integration test suites to verify circuit breakers, retry budgets, and fallback behaviors without touching production
Measuring chaos success: reduction in MTTR for real incidents, increased confidence in DR plans, discovery of previously unknown failure modes before customers experience them

Real-World Example

Netflix runs Chaos Monkey against their production environment every weekday; after years of forced resilience engineering, their engineering culture assumes instance failure is routine — this mindset directly contributed to their ability to survive multiple large-scale AWS outages without customer-visible degradation.

←PreviousBlameless Postmortems NextFinOps & Cost Observability→