Indexing Strategies | Databases & Data Architecture | System Design

Database indexes are auxiliary data structures that trade write amplification and storage for read acceleration. The B-tree index (balanced tree, O(log n) lookup) is the universal default for equality and range queries. Hash indexes give O(1) equality but cannot serve range or prefix queries. Advanced index types — covering, partial, GIN/GiST, and HNSW vector indexes — solve specialized problems but each carries a write-overhead cost that must be justified by read-path SLOs.

Key Points

B-tree pages are typically 8 KB (PostgreSQL) or 16 KB (InnoDB); the tree height is log_f(N) where f is the fanout (~100–400); a 100M-row index is only 3–4 levels deep.
Composite index column order matters: (a, b, c) index satisfies queries filtering on a, (a, b), or (a, b, c) — never on b alone. Put high-cardinality, equality-first columns leftmost.
Covering index: index includes all columns needed by the query (SELECT + WHERE + JOIN), enabling index-only scans that never touch the heap/data page.
Partial index: CREATE INDEX idx ON orders (customer_id) WHERE status = 'pending' — much smaller, faster, and only maintained for rows matching the predicate.
GIN (Generalized Inverted Index) indexes array elements and JSONB keys in PostgreSQL; GiST supports geometric types and full-text tsvector; BRIN (Block Range INdex) is compact for append-only tables like logs.
Too many indexes hurt write throughput: each INSERT/UPDATE/DELETE must maintain every index; a table with 15 indexes can have 3–5x write overhead vs a table with 3 indexes.
Index bloat in PostgreSQL: dead tuples in indexes are not cleaned by VACUUM by default; use pg_stat_user_indexes to identify indexes with high idx_scan=0 (unused).
Vector indexes (HNSW, IVFFlat) in pgvector enable approximate nearest-neighbor search on float32 embedding vectors (768–3072 dimensions for modern LLMs); HNSW trades memory for query accuracy.

Real-World Example

Cloudflare uses partial indexes on their PostgreSQL clusters to index only "active" rows, keeping index size 80% smaller. GitHub uses covering indexes on their MySQL repositories table so hot repository metadata queries are served entirely from the index buffer pool, eliminating table lookups.

←PreviousData Modeling NextQuery Optimization→