Lesson 3 of 23

AI System Design Fundamentals

Scalability Concepts

4 min read

AI systems face unique scaling challenges. LLM calls are slow, expensive, and rate-limited. Understanding how to scale effectively is crucial for system design interviews.

Horizontal vs. Vertical Scaling

Scaling Type How It Works AI System Example
Vertical Bigger machines Larger GPU for faster inference
Horizontal More machines Multiple API instances behind load balancer

For AI systems, horizontal scaling is usually preferred because:

  • LLM providers have rate limits per API key
  • You can use multiple providers simultaneously
  • Failures are isolated to single instances

Key Metrics

Latency

Time from request to response.

User Request → API Gateway → LLM Call → Post-processing → Response
    │              │            │              │            │
    └──────────────┴────────────┴──────────────┴────────────┘
                        Total Latency

Typical AI latency breakdown:

  • API overhead: 10-50ms
  • LLM inference: 500ms-5s (depends on model and tokens)
  • Post-processing: 10-100ms

Throughput

Requests handled per second.

# Calculate throughput limits
llm_latency_seconds = 2  # Average LLM response time
concurrent_requests = 10  # Parallel requests allowed

max_throughput = concurrent_requests / llm_latency_seconds
# = 5 requests per second per instance

Cost per Query

Often the most important metric for AI systems.

Component Cost Factor
Input tokens $0.01-0.03 per 1K tokens (GPT-4)
Output tokens $0.03-0.06 per 1K tokens (GPT-4)
Embedding $0.0001 per 1K tokens
Vector DB $0.05-0.20 per million vectors/month

Scaling Strategies

1. Request Batching

Group multiple requests into one LLM call.

# Instead of 10 separate calls
for query in queries:
    response = llm.complete(query)

# Batch into one call
combined_prompt = "\n---\n".join([
    f"Query {i}: {q}" for i, q in enumerate(queries)
])
batch_response = llm.complete(combined_prompt)

2. Model Routing

Use cheaper models for simple tasks.

def route_to_model(query: str) -> str:
    complexity = estimate_complexity(query)

    if complexity == "simple":
        return "gpt-3.5-turbo"  # Fast, cheap
    elif complexity == "medium":
        return "gpt-4-turbo"    # Balanced
    else:
        return "gpt-4"          # Best quality

3. Caching

Cache identical or similar queries.

Cache hit rates in production:

  • Exact match: 20-40% hit rate
  • Semantic cache: 40-60% hit rate

4. Async Processing

Don't block on slow operations.

# Synchronous (blocks)
result = await llm.complete(prompt)  # Wait 2s

# Async with queue
job_id = queue.enqueue(llm.complete, prompt)
return {"status": "processing", "job_id": job_id}
# Client polls for result

Bottleneck Identification

In interviews, always identify the bottleneck:

Bottleneck Symptom Solution
LLM rate limits 429 errors Multiple API keys, provider fallback
LLM latency Slow responses Streaming, caching, smaller models
Vector DB Slow retrieval Index optimization, sharding
Memory OOM errors Streaming, pagination

Next, we'll learn a structured framework for approaching any design problem. :::

Quiz

Module 1: AI System Design Fundamentals

Take Quiz