Scaling Patterns & Data Infrastructure

Every system design interview eventually reaches the question: "How does this scale to 10x or 100x traffic?" This lesson covers the scaling patterns, database strategies, and specialized data infrastructure you need to answer confidently.

Horizontal vs Vertical Scaling

Aspect	Vertical (Scale Up)	Horizontal (Scale Out)
Method	Bigger machine (more CPU, RAM, disk)	More machines
Limit	Hardware ceiling (largest available instance)	Virtually unlimited
Complexity	Simple (no code changes)	Complex (distributed state, coordination)
Cost curve	Exponential (2x CPU != 2x cost)	Linear (2x servers ≈ 2x cost)
Downtime	Requires restart to resize	Zero-downtime with load balancer
Best for	Databases (before sharding), small services	Stateless API servers, read replicas

Interview rule of thumb: Start vertical for simplicity, then go horizontal when you hit limits. Stateless services (API servers) scale horizontally trivially. Stateful services (databases) require careful planning.

Database Read Replicas

The simplest scaling pattern for read-heavy workloads. One primary (master) handles all writes, and multiple replicas handle reads.

                    ┌──────────────┐
                    │   Primary    │
        Writes ────▶│   (Master)   │
                    └──────┬───────┘
                           │ Replication
              ┌────────────┼────────────┐
              ▼            ▼            ▼
       ┌──────────┐ ┌──────────┐ ┌──────────┐
       │ Replica 1│ │ Replica 2│ │ Replica 3│
       └──────────┘ └──────────┘ └──────────┘
              ▲            ▲            ▲
              └────────────┼────────────┘
                    Reads (load balanced)

Replication Lag

Replicas are eventually consistent. Typical lag is 10-100ms, but can spike under heavy write load. Handle this in your application:

def get_user_profile(user_id: str, just_updated: bool = False) -> dict:
    if just_updated:
        # Read from primary to guarantee seeing the latest write
        return primary_db.query("SELECT * FROM users WHERE id = %s", user_id)
    else:
        # Normal reads go to replica
        return replica_db.query("SELECT * FROM users WHERE id = %s", user_id)

Write-Ahead Logging (WAL)

WAL is the mechanism databases use to ensure durability and crash recovery. Understanding it shows deep database knowledge in interviews.

How WAL works:

Before modifying a data page, the database writes the change to a sequential log file (the WAL)
The WAL write is flushed to disk (fsync)
Only then is the actual data page modified in memory
Data pages are flushed to disk later in batches (checkpoints)

If the database crashes:

Uncommitted changes in memory are lost, but the WAL is on disk
On restart, the database replays the WAL to reconstruct the state
No data that was committed (WAL flushed) is ever lost

PostgreSQL WAL uses: Crash recovery, streaming replication to replicas, point-in-time recovery (PITR), and logical replication (Debezium CDC).

Sharding Strategies

When a single database server cannot handle the write volume or data size, you split data across multiple servers (shards). This is one of the most complex decisions in system design.

Strategy Comparison

Strategy	How It Works	Pros	Cons
Range-based	Shard by key range (e.g., A-M on shard 1, N-Z on shard 2)	Simple range queries, easy to understand	Hot partitions (some ranges much busier)
Hash-based	`shard = hash(key) % num_shards`	Even distribution	Range queries broken (must scatter-gather)
Directory-based	Lookup table maps each key to its shard	Flexible reassignment	Lookup adds latency, directory is SPOF

Hash-Based Sharding Example

import hashlib

NUM_SHARDS = 8

def get_shard(user_id: str) -> int:
    """Determine which shard holds this user's data."""
    hash_value = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
    return hash_value % NUM_SHARDS

def query_user(user_id: str) -> dict:
    shard_id = get_shard(user_id)
    connection = shard_connections[shard_id]
    return connection.query("SELECT * FROM users WHERE id = %s", user_id)

Hot Partition Detection and Mitigation

Some keys receive disproportionate traffic (a celebrity's profile, a viral post). Strategies to mitigate:

Consistent hashing with virtual nodes — spreads load more evenly across physical servers
Salting hot keys — append a random suffix (celebrity_123_0, celebrity_123_1, ... celebrity_123_9) and scatter across 10 shards, then aggregate reads
Dedicated shard for hot data — move detected hot keys to an isolated, beefier shard
Caching layer — cache hot keys in Redis so they never hit the database

Multi-Tenant Data Partitioning

SaaS applications serve multiple tenants (customers). The isolation strategy depends on scale, compliance, and cost:

Strategy	Description	Isolation	Cost	Best For
Shared DB, shared schema	All tenants in same tables, `tenant_id` column	Lowest	Cheapest	Small tenants, startups
Shared DB, separate schemas	Each tenant gets their own schema in one DB	Medium	Low	Medium tenants, moderate data
Dedicated schema per tenant	Separate schema, shared server	Good	Medium	Regulated industries
Dedicated DB per tenant	Each tenant gets their own database server	Highest	Most expensive	Enterprise, strict compliance

-- Shared schema approach: every query includes tenant_id
-- Use Row Level Security (RLS) in PostgreSQL for enforcement

ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

CREATE POLICY tenant_isolation ON orders
    USING (tenant_id = current_setting('app.current_tenant')::uuid);

-- Application sets tenant context per request
SET app.current_tenant = 'tenant-abc-123';
SELECT * FROM orders;  -- Only sees tenant-abc-123's orders

Blob Storage

Large binary objects (images, videos, documents) should never live in your relational database. Use object storage.

Service	Provider	Key Features
S3	AWS	11 nines durability, lifecycle rules, event notifications
Cloud Storage	GCP	Nearline/Coldline tiers, strong consistency
Azure Blob	Azure	Hot/Cool/Archive tiers, immutable storage

Pre-Signed URLs Pattern

import boto3

def generate_upload_url(file_key: str, content_type: str) -> str:
    """Generate a pre-signed URL for direct client-to-S3 upload."""
    s3 = boto3.client("s3")
    url = s3.generate_presigned_url(
        "put_object",
        Params={
            "Bucket": "my-app-uploads",
            "Key": file_key,
            "ContentType": content_type,
        },
        ExpiresIn=300,  # 5 minutes
    )
    return url

# Flow:
# 1. Client requests upload URL from API server
# 2. API server generates pre-signed URL and returns it
# 3. Client uploads directly to S3 (bypasses API server)
# 4. S3 triggers event notification → process/thumbnail/CDN

This pattern keeps large files off your API servers, preventing bandwidth bottlenecks and memory pressure.

Search Infrastructure

When SQL LIKE '%search_term%' is too slow (it always is at scale because it cannot use indexes), you need a dedicated search engine.

Elasticsearch Core Concepts

Concept	Description
Inverted index	Maps every unique word to the list of documents containing it — O(1) lookup
Mapping	Schema definition: which fields are `text` (full-text searchable) vs `keyword` (exact match)
Analyzer	Tokenizer + filters: "Running quickly" → ["run", "quick"] (stemming + lowercasing)
Relevance scoring	TF-IDF or BM25: documents with rare matching terms score higher
Near-real-time	Documents are searchable within ~1 second of indexing (refresh interval)

When to Use Elasticsearch vs SQL

Use Elasticsearch When	Stick with SQL When
Full-text search across multiple fields	Exact lookups by primary key
Fuzzy matching, synonyms, stemming	Simple `WHERE column = value` queries
Faceted search (filter by category + price + rating)	Data fits in memory, < 1M records
Autocomplete/typeahead suggestions	Strong consistency required
Log aggregation and analysis	No budget for additional infrastructure

Time-Series Data

Metrics, IoT sensor data, application logs, and financial ticks all share the pattern: write-heavy, append-only, queried by time range.

Database	Type	Key Feature
InfluxDB	Purpose-built time-series	InfluxQL/Flux query language, built-in downsampling
TimescaleDB	PostgreSQL extension	Full SQL support, hypertables auto-partition by time
ClickHouse	Column-oriented analytics	Extremely fast aggregation queries

Retention and Downsampling

Raw data at 1-second granularity consumes enormous storage. Use tiered retention:

Raw data (1-sec resolution):     keep 7 days
5-minute aggregates:             keep 30 days
1-hour aggregates:               keep 1 year
1-day aggregates:                keep forever

-- TimescaleDB: create a continuous aggregate for 5-minute rollups
CREATE MATERIALIZED VIEW metrics_5min
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('5 minutes', timestamp) AS bucket,
    sensor_id,
    AVG(value) AS avg_value,
    MAX(value) AS max_value,
    MIN(value) AS min_value,
    COUNT(*) AS sample_count
FROM raw_metrics
GROUP BY bucket, sensor_id;

-- Add a retention policy to drop raw data after 7 days
SELECT add_retention_policy('raw_metrics', INTERVAL '7 days');

ETL Pipeline Design

ETL (Extract, Transform, Load) pipelines move data from operational databases to data warehouses for analytics.

┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│   Source DB  │───▶│   Extract    │───▶│  Transform   │───▶│    Load      │
│ (PostgreSQL) │    │  (Debezium   │    │  (Spark /    │    │(BigQuery /   │
│              │    │   CDC)       │    │   dbt)       │    │ Snowflake /  │
└──────────────┘    └──────────────┘    └──────────────┘    │ Redshift)    │
                                                            └──────────────┘

CDC (Change Data Capture) with Debezium

Instead of periodic batch exports (which miss data between runs and strain the source DB), CDC captures every INSERT, UPDATE, and DELETE as it happens by reading the database's WAL.

Source DB WAL → Debezium Connector → Kafka Topic → Consumer → Data Warehouse

Benefits:
- Near-real-time (seconds, not hours)
- No load on source database (reads WAL, not tables)
- Captures deletes (batch export misses these)
- Maintains exact ordering of changes

Transform Layer

Tool	Type	Best For
dbt	SQL-first transforms	Analysts who know SQL, declarative transformations
Apache Spark	Distributed compute	Large-scale data (TBs+), complex transformations
Apache Flink	Stream processing	Real-time transformations, windowed aggregations

Interview tip: When asked about data pipelines, mention CDC over batch ETL to show modern thinking. Highlight the trade-off: CDC is more complex to set up but provides fresher data and lower source database load.

You have completed all three lessons in this module. Take the quiz to test your understanding of system design frameworks, classic problems, and scaling patterns. :::