Distributed systems fail in ways that single-process applications do not — network partitions, partial failures, message loss, and duplicate delivery all require dedicated patterns to handle gracefully. The essential distributed patterns are: Circuit Breaker (fail fast on degraded dependencies), Bulkhead (isolate failure domains), Retry with backoff (handle transient failures), Saga (distributed transactions without 2PC), Outbox (guaranteed event publication with database writes), and Idempotency keys (safe retries for non-idempotent operations). These patterns are implemented by libraries (Resilience4j, Polly, Hystrix) and by service meshes (Envoy/Istio) — understanding them determines whether a distributed system is resilient or fragile.

Key Points

  • Circuit Breaker states: Closed (normal, requests pass through) → Open (failures exceeded threshold, fast-fail all requests) → Half-Open (probe with one request; if successful, close; if not, reopen).
  • Bulkhead: dedicate separate thread pools or connection pools to critical vs. non-critical paths — prevents a slow third-party API from exhausting the shared thread pool and blocking your critical checkout flow.
  • Retry with exponential backoff + jitter: wait = min(base × 2^attempt + random_jitter, max_wait) — jitter prevents synchronized retry storms when a dependency recovers and all clients retry simultaneously.
  • Saga (choreography): each service publishes an event on success; each service has a compensating transaction triggered by a failure event — no coordinator, but complex failure chain debugging.
  • Transactional Outbox: write to DB and an outbox table in one transaction; a separate relay process reads the outbox and publishes events to Kafka — prevents the dual write problem (DB written, Kafka publish fails).
  • Idempotency keys: client generates a UUID for each operation; server stores (key, result) and returns cached result for duplicate requests — essential for payment APIs to prevent double charges on retry.
  • Rate limiting patterns: Token Bucket (smooth burst absorption), Leaky Bucket (strict output rate), Fixed Window (simple but allows burst at window boundaries), Sliding Window (accurate, higher memory cost).
  • Backpressure: when a consumer is slower than a producer, explicitly signal the producer to slow down (TCP flow control, Reactive Streams backpressure, Kafka consumer lag monitoring).

Real-World Example

Netflix's Hystrix library (now in maintenance, succeeded by Resilience4j) implemented Circuit Breaker, Bulkhead, and Timeout patterns and is credited with allowing Netflix to maintain 99.99% availability despite regularly experiencing failures in one or more of the hundreds of services it depends on. A single dependency failure that would have cascaded system-wide before Hystrix was isolated to the specific feature powered by that dependency.