System Design: Concepts, Patterns, Technologies
My reference for system design interviews. Three sections: core concepts (the building blocks), common patterns (recurring solutions), and key technologies (when to reach for what).
Scan the tables to refresh. Read the bold text for the key decisions and tradeoffs.
Core Concepts
Networking Essentials
| Topic | Key Points |
|---|---|
| TCP vs UDP | TCP: reliable, ordered, connection-oriented (HTTP, databases). UDP: fast, no guarantees (video streaming, DNS). |
| HTTP/HTTPS | Request-response over TCP. HTTP/2 adds multiplexing. HTTP/3 uses QUIC (UDP-based). TLS for encryption. |
| WebSockets | Persistent bidirectional connection over TCP. Use for real-time (chat, live updates). Initiated via HTTP upgrade. |
| DNS | Domain to IP resolution. TTL controls caching. Can route by geography for latency. |
| Load Balancers | L4 (transport): routes by IP/port, fast, no inspection. L7 (application): routes by URL/headers, can do auth and rate limiting. |
The key distinction in interviews is L4 vs L7. L4 is faster but dumb. L7 is slower but can make smart routing decisions based on content.
API Design
| Style | Use When | Tradeoffs |
|---|---|---|
| REST | Public APIs, CRUD, simple resource models | Well-understood, cacheable. Overfetching/underfetching common. |
| gRPC | Service-to-service, low latency, streaming | Binary protobuf, fast, typed contracts. Not browser-friendly without proxy. |
| GraphQL | Client-driven queries, mobile apps needing flexible data | Single endpoint, no overfetching. Complexity on server, caching harder. |
Key decisions: pagination (cursor-based for real-time data, offset for static), idempotency (POST with client-generated ID), versioning (URL path is simplest: /v1/resource).
Data Modeling
| Approach | Strengths | Weaknesses |
|---|---|---|
| Relational (PostgreSQL) | ACID, joins, complex queries, strong consistency | Schema rigidity, harder to shard |
| Document (MongoDB) | Flexible schema, nested data, horizontal scaling | No joins, denormalized data can diverge |
| Wide-column (Cassandra) | Massive write throughput, time-series, multi-DC | Limited query patterns, must design around partition key |
| Key-value (Redis, DynamoDB) | Sub-ms latency, simple access patterns | No complex queries |
Design your schema around how data is read, not how itβs logically organized. If reads far outnumber writes, denormalize. If you need joins, use relational.
Caching
| Strategy | How It Works | Use When |
|---|---|---|
| Cache-aside | App checks cache first, falls back to DB on miss, fills cache after | General purpose, most common |
| Write-through | Write to cache and DB synchronously on every write | Need cache and DB always in sync |
| Write-back | Write to cache only, async flush to DB | Write-heavy, can tolerate data loss risk |
| Read-through | Cache itself fetches from DB on miss | Simplify app logic, cache acts as proxy |
Cache-aside read path:
App β Cache β HIT β return
β MISS β read DB β fill cache β return
Eviction: LRU (most common), TTL (simplest to reason about), LFU (frequency-based, good for skewed access).
Cache invalidation is the hard part. TTL is simplest: accept staleness up to N seconds. Event-driven invalidation (publish on write, subscribers evict) is more precise but more complex. When in doubt, start with TTL.
Sharding
| Strategy | How It Works | Tradeoffs |
|---|---|---|
| Hash-based | Hash the partition key, mod by shard count | Even distribution, but range queries hit all shards |
| Range-based | Assign contiguous key ranges to shards | Efficient range queries, but hot ranges cause imbalance |
Choose a partition key with high cardinality and even distribution. Bad key: country (skewed). Good key: user ID (uniform). Cross-shard queries are expensive. Resharding is painful without consistent hashing.
Replication
| Strategy | How It Works | Tradeoffs |
|---|---|---|
| Leader-Follower | One leader handles writes, followers replicate and serve reads | Simple, but leader is a bottleneck. Replication lag means stale reads. |
| Multi-Leader | Multiple nodes accept writes, replicate to each other | Better write availability across regions. Conflict resolution is hard. |
| Leaderless (Quorum) | Read/write to multiple nodes. Consistency when W + R > N. | High availability, tunable consistency. More complex client logic. |
Leader-follower is the default. Most SQL databases use it. Go multi-leader for multi-region writes. Go leaderless (Dynamo-style) when you need high availability and can handle eventual consistency.
Consistent Hashing
Hash Ring (linear view):
0 βββ NA βββ NB βββ NC βββ ND βββ 0
β β β
k1 k2 k3
Keys route to the next node clockwise:
k1 β NA k2 β NB k3 β NC
| Concept | Detail |
|---|---|
| Hash ring | Both nodes and keys are hashed to positions on a ring. Each key routes to the next node clockwise. |
| Adding a node | Only keys between the new node and its predecessor move. Minimal redistribution. |
| Virtual nodes | Each physical node maps to multiple ring positions. Improves balance. |
| Used in | Distributed caches, DynamoDB partitioning, Cassandra, CDN routing. |
The point is minimal disruption. Adding or removing a node only affects its immediate neighbors, not a full reshuffle.
CAP Theorem
| Choice | Behavior During Partition | Examples |
|---|---|---|
| CP (Consistency) | Reject requests rather than serve stale data | ZooKeeper, HBase, etcd |
| AP (Availability) | Serve requests, accept eventual consistency | Cassandra, DynamoDB (default), CouchDB |
CAP only forces a choice during network partitions. When the network is healthy, you get both C and A. Most production systems choose AP with tunable consistency: DynamoDB lets you choose strong or eventual per read, Cassandra lets you set consistency level per query.
Rate Limiting
| Algorithm | How It Works | Tradeoffs |
|---|---|---|
| Token bucket | Tokens added at fixed rate, each request costs one | Allows bursts up to bucket size. Most common. |
| Sliding window log | Store timestamp of each request, count within window | Precise, but high memory at scale |
| Sliding window counter | Weighted count from current + previous window | Memory-efficient approximation. Good enough for most cases. |
| Fixed window counter | Count requests per fixed time window (e.g., per minute) | Simplest. Spike at window boundaries (double rate across boundary). |
Token bucket is the standard choice. It handles bursts naturally and is what most API gateways implement. Rate limit by IP for anonymous traffic, by API key or user ID for authenticated traffic.
Database Indexing
| Index Type | Best For | How It Works |
|---|---|---|
| B-tree | Range queries, sorted access | Balanced tree, O(log n). Default in most databases. |
| Hash | Exact-match lookups | O(1) lookup. No range queries. |
| Composite | Multi-column queries | Leftmost prefix rule: index on (a, b, c) supports (a), (a, b), (a, b, c). |
| Covering | Avoiding table lookups | Index includes all columns the query needs, no row fetch required. |
Indexes speed reads but slow writes. Every insert and update must update every relevant index. Donβt over-index. Start with the queries you need to optimize, add indexes for those.
Numbers to Know
Latency:
| Operation | Time |
|---|---|
| L1 cache reference | ~1 ns |
| L2 cache reference | ~4 ns |
| RAM access | ~100 ns |
| SSD random read | ~100 ΞΌs |
| HDD seek | ~10 ms |
| Same-datacenter round trip | ~0.5 ms |
| Cross-continent round trip | ~150 ms |
Throughput (order of magnitude):
| Component | Ballpark |
|---|---|
| Single web server | ~10K QPS |
| Single SQL database | ~10K QPS |
| Redis | ~100K QPS |
| Kafka (per broker) | ~100K msgs/sec |
Storage:
| Unit | Size |
|---|---|
| 1 tweet (text + metadata) | ~1 KB |
| 1 image (compressed) | ~200 KB |
| 1 minute of video (720p) | ~5 MB |
Time conversions for estimation:
| Period | Seconds |
|---|---|
| 1 day | ~100K |
| 1 month | ~2.5M |
| 1 year | ~30M |
Back-of-Envelope Estimation
The goal is order of magnitude, not precision. 2x off is fine. 100x off means you picked the wrong architecture.
Method:
- Start with users or requests (DAU, writes/day, reads/day)
- Estimate per-unit size or per-unit cost
- Multiply out: per second, per day, per year
- Round aggressively
Worked example: URL shortener storage
- 100M new URLs per month
- Each record: short code (7 B) + long URL (~200 B) + metadata (~50 B) β 250 B
- Monthly storage: 100M x 250 B = 25 GB/month
- 5-year retention: 25 GB x 60 = 1.5 TB total
- Read QPS: 10:1 read/write β 1B reads/month β 1B / 2.5M sec β 400 QPS
1.5 TB fits on a single machine. 400 QPS is trivially handled. This tells you the bottleneck isnβt storage or throughput, itβs latency (caching helps) and availability (replication helps).
Common Patterns
Real-time Updates
| Approach | How It Works | Use When |
|---|---|---|
| WebSockets | Persistent bidirectional connection | Chat, collaborative editing, gaming |
| Server-Sent Events | Server pushes over HTTP, one-directional | Live feeds, notifications, dashboards |
| Long polling | Client sends request, server holds until data available | Fallback when WebSockets not supported |
For feeds and timelines, the key decision is fan-out strategy:
| Strategy | How It Works | Use When |
|---|---|---|
| Fan-out on write | Push updates to all subscriber inboxes at write time | Most users have small follower counts |
| Fan-out on read | Pull and merge updates at read time | Some users have millions of followers (celebrities) |
| Hybrid | Fan-out on write for normal users, fan-out on read for high-follower users | Twitter-scale systems |
Dealing with Contention
| Approach | How It Works | Use When |
|---|---|---|
| Optimistic locking | Read version, write with version check, retry on conflict | Low contention (most writes succeed) |
| Pessimistic locking | Acquire lock before read-modify-write | High contention (conflicts are expensive) |
| CAS (Compare-and-Swap) | Atomic conditional update at the DB level | Counters, inventory, simple state transitions |
| Queue writes | Serialize concurrent writes through a queue | Ordering matters, or writes need complex processing |
Default to optimistic locking. Switch to pessimistic or queuing when conflict rates are high enough that retries become wasteful.
Multi-step Processes
| Pattern | How It Works | Use When |
|---|---|---|
| Saga (choreography) | Each service emits events, next service reacts | Loosely coupled services, simple flows |
| Saga (orchestration) | Central coordinator directs each step | Complex flows, need visibility into process state |
| Two-phase commit | Coordinator asks all to prepare, then commit/abort | Strong consistency across services. Avoid if possible (slow, fragile). |
Idempotency is the foundation. Every step must be safe to retry. Use client-generated UUIDs as idempotency keys so retries donβt create duplicates. Every compensating action (undo) must also be idempotent.
Scaling Reads
| Technique | How It Works | Tradeoff |
|---|---|---|
| Read replicas | Route reads to follower replicas | Replication lag (stale reads) |
| Caching | Cache hot data in Redis or Memcached | Invalidation complexity |
| CDN | Cache static/semi-static content at the edge | Only for cacheable content |
| Denormalization | Pre-join data at write time | Faster reads, harder writes |
| Materialized views | Precomputed query results, refreshed periodically | Stale between refreshes |
Scaling Writes
| Technique | How It Works | Tradeoff |
|---|---|---|
| Sharding | Partition data across nodes | Cross-shard queries, resharding pain |
| Write-ahead log | Append-only log, apply changes asynchronously | Sequential I/O is fast, crash recovery |
| Batching | Buffer writes, flush in bulk | Higher throughput, higher per-write latency |
| Async processing | Accept write into queue, process later | Fast ack to client, eventual consistency |
| Event sourcing | Store events as source of truth, derive state | Full audit trail, complex to query current state |
Handling Large Blobs
| Concern | Approach |
|---|---|
| Storage | Object store (S3, GCS). Never store blobs in your database. |
| Uploads | Presigned URLs for direct client-to-S3 upload. Chunked uploads for large files (resumable). |
| Serving | CDN in front of object store. Signed URLs for access control. |
| Processing | Async pipeline triggered by upload event (thumbnails, transcoding, virus scan). |
Managing Long Running Tasks
| Concern | Approach |
|---|---|
| Dispatch | Message queue (SQS, Kafka) decouples producer from worker |
| Execution | Worker pool pulls from queue, processes independently |
| Reliability | Checkpointing for progress. Idempotent retries on failure. |
| Failure | Dead letter queue for messages that repeatedly fail. Alert on DLQ depth. |
| Visibility | Status tracking in DB. Status endpoint for clients to poll. |
Key Technologies
| Technology | What It Is | Reach For It When |
|---|---|---|
| Redis | In-memory key-value store | Caching, rate limiting, leaderboards, pub/sub, session storage. Sub-ms reads. |
| Elasticsearch | Distributed search engine | Full-text search, log aggregation (ELK), faceted search, autocomplete. |
| Kafka | Distributed event streaming | Event-driven architecture, decoupling services, high-throughput messaging, replay. |
| API Gateway | Reverse proxy at the edge | Rate limiting, auth, routing, SSL termination, versioning. |
| Cassandra | Wide-column distributed DB | High write throughput, time-series, multi-DC replication. |
| DynamoDB | Managed key-value / document DB | Predictable single-digit ms latency at any scale, serverless backends. |
| PostgreSQL | Relational database | ACID, complex queries, joins. Default choice until you outgrow it. |
| Flink | Stream processing framework | Real-time aggregations, windowed computations, complex event processing. |
| ZooKeeper | Distributed coordination | Leader election, distributed locks, config management. Being replaced by etcd in newer systems. |
Resources:
- Hello Interview - System Design - Structured course with question breakdowns
- System Design Primer - Comprehensive open-source reference
- Designing Data-Intensive Applications - The foundational book on distributed systems