Lesson 3 of 23
AI System Design Fundamentals

Scalability Concepts

4 min read

AI systems face unique scaling challenges. LLM calls are slow, expensive, and rate-limited. Understanding how to scale effectively is crucial for system design interviews.

Horizontal vs. Vertical Scaling

Scaling TypeHow It WorksAI System Example
VerticalBigger machinesLarger GPU for faster inference
HorizontalMore machinesMultiple API instances behind load balancer

For AI systems, horizontal scaling is usually preferred because:

  • LLM providers have rate limits per API key
  • You can use multiple providers simultaneously
  • Failures are isolated to single instances

Key Metrics

Latency

Time from request to response.

User Request → API Gateway → LLM Call → Post-processing → Response
    │              │            │              │            │
    └──────────────┴────────────┴──────────────┴────────────┘
                        Total Latency

Typical AI latency breakdown:

  • API overhead: 10-50ms
  • LLM inference: 500ms-5s (depends on model and tokens)
  • Post-processing: 10-100ms

Throughput

Requests handled per second.

# Calculate throughput limits
llm_latency_seconds = 2  # Average LLM response time
concurrent_requests = 10  # Parallel requests allowed

max_throughput = concurrent_requests / llm_latency_seconds
# = 5 requests per second per instance

Cost per Query

Often the most important metric for AI systems.

ComponentCost Factor
Input tokens$0.003 per 1K tokens (GPT-5.4)
Output tokens$0.012 per 1K tokens (GPT-5.4)
Embedding$0.0001 per 1K tokens
Vector DB$0.05-0.20 per million vectors/month

Scaling Strategies

1. Request Batching

Group multiple requests into one LLM call.

# Instead of 10 separate calls
for query in queries:
    response = llm.complete(query)

# Batch into one call
combined_prompt = "\n---\n".join([
    f"Query {i}: {q}" for i, q in enumerate(queries)
])
batch_response = llm.complete(combined_prompt)

2. Model Routing

Use cheaper models for simple tasks.

def route_to_model(query: str) -> str:
    complexity = estimate_complexity(query)

    if complexity == "simple":
        return "gpt-5.4-mini"   # Fast, cheap
    elif complexity == "medium":
        return "gpt-5.4"        # Balanced
    else:
        return "gpt-5.4-pro"    # Best quality

3. Caching

Cache identical or similar queries.

Cache hit rates in production:

  • Exact match: 20-40% hit rate
  • Semantic cache: 40-60% hit rate

4. Async Processing

Don't block on slow operations.

# Synchronous (blocks)
result = await llm.complete(prompt)  # Wait 2s

# Async with queue
job_id = queue.enqueue(llm.complete, prompt)
return {"status": "processing", "job_id": job_id}
# Client polls for result

Bottleneck Identification

In interviews, always identify the bottleneck:

BottleneckSymptomSolution
LLM rate limits429 errorsMultiple API keys, provider fallback
LLM latencySlow responsesStreaming, caching, smaller models
Vector DBSlow retrievalIndex optimization, sharding
MemoryOOM errorsStreaming, pagination

Next, we'll learn a structured framework for approaching any design problem. :::

Quick check: how does this lesson land for you?

Quiz

Module 1: AI System Design Fundamentals

Take Quiz