Bottleneck Identification
Profiling (CPU, memory, I/O), flame graphs, distributed trace analysis
Bottleneck identification is the systematic process of locating the constraint limiting system performance — the resource (CPU, memory, I/O, network, lock contention) or code path that becomes saturated first under load. The methodology follows Amdahl's Law: parallelizing 90% of work with 10% sequential gives maximum 10x speedup regardless of added parallelism. Profiling (CPU, memory, I/O), flame graphs, and distributed tracing (Jaeger, Zipkin, Datadog APM) are the primary investigative tools used by performance engineers at every major technology company.
Key Points
- CPU profiling: sampling profilers (Linux perf, Java VisualVM, py-spy) capture stack traces at regular intervals — visualized as flame graphs showing where CPU time is spent.
- Flame graphs: Brendan Gregg's visualization; width = CPU time proportion, height = call stack depth — wide plateaus are optimization targets.
- Memory profiling: heap allocation profiling (Java JProfiler, Go pprof, Python memory_profiler) identifies allocation hot paths and objects holding references preventing GC.
- I/O bottlenecks: identify with `iostat -x 1` (disk saturation), `ss -s` (socket states), `netstat -s` (network errors) — look for high iowait% and queue depth.
- Distributed tracing: instruments each service call with trace ID + span; Jaeger and Zipkin visualize end-to-end request waterfall — immediately shows which service contributes most latency.
- Lock contention: Java thread dumps, Go goroutine dumps, PostgreSQL `pg_locks` view — identify threads blocked waiting for mutexes or row-level locks.
- Database slow query analysis: `pg_stat_statements` in PostgreSQL ranks queries by total execution time — sort by total_time DESC to find the highest-impact optimization targets.
- USE Method (Brendan Gregg): for every resource, check Utilization (busy %), Saturation (queue length), Errors — systematic elimination of bottleneck candidates.
Real-World Example
Netflix's performance team used CPU flame graphs to discover that a JVM JIT compilation hot path was consuming 15% of CPU in their Zuul API gateway under sustained load. A single JVM flag change (`-XX:+UseStringDeduplication`) reduced memory pressure by 20% and improved p99 latency by 35ms.