System Design Fundamentals

Design for Scale & Reliability

4 min read

Scaling and reliability are what separate toy projects from production systems. Interviewers expect you to discuss these topics naturally, especially at L4+ levels.

Horizontal vs Vertical Scaling

Approach How Pros Cons
Vertical (scale up) Bigger machine (more CPU, RAM) Simple, no code changes Hardware limits, single point of failure
Horizontal (scale out) More machines No hardware limit, fault tolerant Code complexity, data consistency

Rule of thumb: Start vertical for simplicity, go horizontal when you hit limits. In interviews, always design for horizontal scaling.

Database Sharding Strategies

Strategy How It Works Best For
Range-based Partition by value range (A-M, N-Z) Time-series data, sequential access
Hash-based hash(key) % num_shards Even distribution
Consistent hashing Virtual ring, minimize redistribution Dynamic scaling
Geographic Partition by region Global applications

Consistent Hashing Deep Dive:

Traditional hashing (hash % N) breaks when you add/remove servers -- almost all keys get remapped. Consistent hashing uses a virtual ring where only K/N keys are redistributed (K = total keys, N = servers).

Virtual Ring:
  Server A ──── Server B
  │                    │
  │    keys mapped     │
  │    to nearest      │
  │    server CW       │
  Server D ──── Server C

Event-Driven Architecture

Instead of synchronous request-response, use events for loose coupling:

User Action → Event Producer → Event Bus (Kafka) → Event Consumers
                                                    ├── Analytics Service
                                                    ├── Notification Service
                                                    └── Search Indexer

Benefits: Services are independent, can scale separately, easier to add new consumers.

Reliability Patterns

Circuit Breaker

Prevents cascading failures when a downstream service is unhealthy:

State Behavior
Closed Normal operation, requests pass through
Open Service detected as down, requests fail immediately
Half-Open Allow limited requests to test recovery

Health Checks

Type What It Checks Frequency
Liveness Is the process running? Every 10s
Readiness Can it handle requests? Every 5s
Deep Are dependencies healthy? Every 30s

Retry with Exponential Backoff

Attempt 1: Wait 1 second
Attempt 2: Wait 2 seconds
Attempt 3: Wait 4 seconds
Attempt 4: Wait 8 seconds (+ random jitter)
Give up after max retries → Dead Letter Queue

SLOs, SLIs, and SLAs

Term Definition Example
SLI (Service Level Indicator) Metric you measure 99.2% of requests complete in <200ms
SLO (Service Level Objective) Target you aim for 99.9% availability per month
SLA (Service Level Agreement) Contract with customers 99.95% uptime or credits issued

The Nines:

Availability Downtime/Year Downtime/Month
99% 3.65 days 7.3 hours
99.9% 8.76 hours 43.8 minutes
99.99% 52.6 minutes 4.38 minutes
99.999% 5.26 minutes 26.3 seconds

CQRS (Command Query Responsibility Segregation)

Separate read and write models when they have different requirements:

Write Path: API → Command Handler → Write DB (PostgreSQL)
                                         ↓ (Event)
Read Path:  API → Query Handler → Read DB (Elasticsearch/Redis)

When to use: Read and write patterns are very different (e.g., social media: few writes, many reads with complex queries).

Interview Tip: When discussing any design, always mention: "What happens when this component fails?" This shows you think about reliability proactively.


With system design covered, let's move to coding round mastery -- the practical skills for performing under pressure. :::

Quiz

Module 4 Quiz: System Design Fundamentals

Take Quiz