Design for Scale & Reliability

Scaling and reliability are what separate toy projects from production systems. Interviewers expect you to discuss these topics naturally, especially at L4+ levels.

Horizontal vs Vertical Scaling

Approach	How	Pros	Cons
Vertical (scale up)	Bigger machine (more CPU, RAM)	Simple, no code changes	Hardware limits, single point of failure
Horizontal (scale out)	More machines	No hardware limit, fault tolerant	Code complexity, data consistency

Rule of thumb: Start vertical for simplicity, go horizontal when you hit limits. In interviews, always design for horizontal scaling.

Database Sharding Strategies

Strategy	How It Works	Best For
Range-based	Partition by value range (A-M, N-Z)	Time-series data, sequential access
Hash-based	hash(key) % num_shards	Even distribution
Consistent hashing	Virtual ring, minimize redistribution	Dynamic scaling
Geographic	Partition by region	Global applications

Consistent Hashing Deep Dive:

Traditional hashing (hash % N) breaks when you add/remove servers -- almost all keys get remapped. Consistent hashing uses a virtual ring where only K/N keys are redistributed (K = total keys, N = servers).

Virtual Ring:
  Server A ──── Server B
  │                    │
  │    keys mapped     │
  │    to nearest      │
  │    server CW       │
  Server D ──── Server C

Event-Driven Architecture

Instead of synchronous request-response, use events for loose coupling:

User Action → Event Producer → Event Bus (Kafka) → Event Consumers
                                                    ├── Analytics Service
                                                    ├── Notification Service
                                                    └── Search Indexer

Benefits: Services are independent, can scale separately, easier to add new consumers.

Reliability Patterns

Circuit Breaker

Prevents cascading failures when a downstream service is unhealthy:

State	Behavior
Closed	Normal operation, requests pass through
Open	Service detected as down, requests fail immediately
Half-Open	Allow limited requests to test recovery

Health Checks

Type	What It Checks	Frequency
Liveness	Is the process running?	Every 10s
Readiness	Can it handle requests?	Every 5s
Deep	Are dependencies healthy?	Every 30s

Retry with Exponential Backoff

Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds (+ random jitter)
Give up after max retries → Dead Letter Queue

SLOs, SLIs, and SLAs

Term	Definition	Example
SLI (Service Level Indicator)	Metric you measure	99.2% of requests complete in <200ms
SLO (Service Level Objective)	Target you aim for	99.9% availability per month
SLA (Service Level Agreement)	Contract with customers	99.95% uptime or credits issued

The Nines:

Availability	Downtime/Year	Downtime/Month
99%	3.65 days	7.3 hours
99.9%	8.76 hours	43.8 minutes
99.99%	52.6 minutes	4.38 minutes
99.999%	5.26 minutes	26.3 seconds

CQRS (Command Query Responsibility Segregation)

Separate read and write models when they have different requirements:

Write Path: API → Command Handler → Write DB (PostgreSQL)
                                         ↓ (Event)
Read Path:  API → Query Handler → Read DB (Elasticsearch/Redis)

When to use: Read and write patterns are very different (e.g., social media: few writes, many reads with complex queries).

Interview Tip: When discussing any design, always mention: "What happens when this component fails?" This shows you think about reliability proactively.

With system design covered, let's move to coding round mastery -- the practical skills for performing under pressure. :::