System Design Basics
Scalability, reliability, availability, maintainability fundamentals
System design is the process of defining the architecture, components, modules, interfaces, and data flow of a system to satisfy specified requirements. The four pillars every senior engineer must balance are scalability (handling growth), reliability (surviving failures), availability (staying accessible), and maintainability (supporting long-term evolution). Every design decision is a trade-off — adding a cache improves performance but adds consistency complexity; adding replication improves availability but adds write overhead. Strong system design begins with clarifying requirements (functional vs. non-functional), estimating scale (QPS, storage, bandwidth), and then choosing architectural primitives that satisfy the dominant constraints.
Key Points
- Clarify scale requirements first: 100 RPS vs. 1M RPS require fundamentally different architectures — a single Postgres instance handles the former; a sharded cluster with caching handles the latter.
- Back-of-envelope estimation: storage = daily writes × record size × retention days; bandwidth = QPS × average response size; start estimates before drawing boxes.
- Functional requirements define what the system does; NFRs define how well it does it — interviewers penalize candidates who jump to solutions without extracting both.
- Single Responsibility at the system level: each service should have one reason to change — mixing user authentication and payment processing in one service is a design smell.
- Scalability requires stateless application layers — any server can handle any request — with shared state externalized to caches (Redis) and databases (Cassandra, PostgreSQL).
- Reliability requires eliminating single points of failure (SPOFs) at every layer: no single DB primary, no single load balancer, no single DNS resolver.
- Start with a simple architecture and complicate only when scale demands it — premature distribution (microservices before you need them) creates operational overhead without benefit.
- Draw data flow diagrams, not just component diagrams — tracing a write operation from client to storage reveals consistency boundaries, failure modes, and latency sources.
Real-World Example
Instagram at its 2012 acquisition by Facebook had 13 employees and 30M users, running on a deliberately simple architecture: EC2 + PostgreSQL with PostGIS + Redis + Solr. They iterated to complexity (sharding, custom photo storage) only when specific bottlenecks arose — a textbook example of evolutionary architecture driven by measured constraints.