System Design Fundamentals
Design for Scale & Reliability
Scaling and reliability are what separate toy projects from production systems. Interviewers expect you to discuss these topics naturally, especially at L4+ levels.
Horizontal vs Vertical Scaling
| Approach | How | Pros | Cons |
|---|---|---|---|
| Vertical (scale up) | Bigger machine (more CPU, RAM) | Simple, no code changes | Hardware limits, single point of failure |
| Horizontal (scale out) | More machines | No hardware limit, fault tolerant | Code complexity, data consistency |
Rule of thumb: Start vertical for simplicity, go horizontal when you hit limits. In interviews, always design for horizontal scaling.
Database Sharding Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Range-based | Partition by value range (A-M, N-Z) | Time-series data, sequential access |
| Hash-based | hash(key) % num_shards | Even distribution |
| Consistent hashing | Virtual ring, minimize redistribution | Dynamic scaling |
| Geographic | Partition by region | Global applications |
Consistent Hashing Deep Dive:
Traditional hashing (hash % N) breaks when you add/remove servers -- almost all keys get remapped. Consistent hashing uses a virtual ring where only K/N keys are redistributed (K = total keys, N = servers).
Virtual Ring:
Server A ──── Server B
│ │
│ keys mapped │
│ to nearest │
│ server CW │
Server D ──── Server C
Event-Driven Architecture
Instead of synchronous request-response, use events for loose coupling:
User Action → Event Producer → Event Bus (Kafka) → Event Consumers
├── Analytics Service
├── Notification Service
└── Search Indexer
Benefits: Services are independent, can scale separately, easier to add new consumers.
Reliability Patterns
Circuit Breaker
Prevents cascading failures when a downstream service is unhealthy:
| State | Behavior |
|---|---|
| Closed | Normal operation, requests pass through |
| Open | Service detected as down, requests fail immediately |
| Half-Open | Allow limited requests to test recovery |
Health Checks
| Type | What It Checks | Frequency |
|---|---|---|
| Liveness | Is the process running? | Every 10s |
| Readiness | Can it handle requests? | Every 5s |
| Deep | Are dependencies healthy? | Every 30s |
Retry with Exponential Backoff
Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds (+ random jitter)
Give up after max retries → Dead Letter Queue
SLOs, SLIs, and SLAs
| Term | Definition | Example |
|---|---|---|
| SLI (Service Level Indicator) | Metric you measure | 99.2% of requests complete in <200ms |
| SLO (Service Level Objective) | Target you aim for | 99.9% availability per month |
| SLA (Service Level Agreement) | Contract with customers | 99.95% uptime or credits issued |
The Nines:
| Availability | Downtime/Year | Downtime/Month |
|---|---|---|
| 99% | 3.65 days | 7.3 hours |
| 99.9% | 8.76 hours | 43.8 minutes |
| 99.99% | 52.6 minutes | 4.38 minutes |
| 99.999% | 5.26 minutes | 26.3 seconds |
CQRS (Command Query Responsibility Segregation)
Separate read and write models when they have different requirements:
Write Path: API → Command Handler → Write DB (PostgreSQL)
↓ (Event)
Read Path: API → Query Handler → Read DB (Elasticsearch/Redis)
When to use: Read and write patterns are very different (e.g., social media: few writes, many reads with complex queries).
Interview Tip: When discussing any design, always mention: "What happens when this component fails?" This shows you think about reliability proactively.
With system design covered, let's move to coding round mastery -- the practical skills for performing under pressure. :::