System Design Basics | Architecture Fundamentals | System Design

System design is the process of defining the architecture, components, modules, interfaces, and data flow of a system to satisfy specified requirements. The four pillars every senior engineer must balance are scalability (handling growth), reliability (surviving failures), availability (staying accessible), and maintainability (supporting long-term evolution). Every design decision is a trade-off — adding a cache improves performance but adds consistency complexity; adding replication improves availability but adds write overhead. Strong system design begins with clarifying requirements (functional vs. non-functional), estimating scale (QPS, storage, bandwidth), and then choosing architectural primitives that satisfy the dominant constraints.

Key Points

Clarify scale requirements first: 100 RPS vs. 1M RPS require fundamentally different architectures — a single Postgres instance handles the former; a sharded cluster with caching handles the latter.
Back-of-envelope estimation: storage = daily writes × record size × retention days; bandwidth = QPS × average response size; start estimates before drawing boxes.
Functional requirements define what the system does; NFRs define how well it does it — interviewers penalize candidates who jump to solutions without extracting both.
Single Responsibility at the system level: each service should have one reason to change — mixing user authentication and payment processing in one service is a design smell.
Scalability requires stateless application layers — any server can handle any request — with shared state externalized to caches (Redis) and databases (Cassandra, PostgreSQL).
Reliability requires eliminating single points of failure (SPOFs) at every layer: no single DB primary, no single load balancer, no single DNS resolver.
Start with a simple architecture and complicate only when scale demands it — premature distribution (microservices before you need them) creates operational overhead without benefit.
Draw data flow diagrams, not just component diagrams — tracing a write operation from client to storage reveals consistency boundaries, failure modes, and latency sources.

Real-World Example

Instagram at its 2012 acquisition by Facebook had 13 employees and 30M users, running on a deliberately simple architecture: EC2 + PostgreSQL with PostGIS + Redis + Solr. They iterated to complexity (sharding, custom photo storage) only when specific bottlenecks arose — a textbook example of evolutionary architecture driven by measured constraints.

NextCAP Theorem→