AI System Design Fundamentals
Scalability Concepts
4 min read
AI systems face unique scaling challenges. LLM calls are slow, expensive, and rate-limited. Understanding how to scale effectively is crucial for system design interviews.
Horizontal vs. Vertical Scaling
| Scaling Type | How It Works | AI System Example |
|---|---|---|
| Vertical | Bigger machines | Larger GPU for faster inference |
| Horizontal | More machines | Multiple API instances behind load balancer |
For AI systems, horizontal scaling is usually preferred because:
- LLM providers have rate limits per API key
- You can use multiple providers simultaneously
- Failures are isolated to single instances
Key Metrics
Latency
Time from request to response.
User Request → API Gateway → LLM Call → Post-processing → Response
│ │ │ │ │
└──────────────┴────────────┴──────────────┴────────────┘
Total Latency
Typical AI latency breakdown:
- API overhead: 10-50ms
- LLM inference: 500ms-5s (depends on model and tokens)
- Post-processing: 10-100ms
Throughput
Requests handled per second.
# Calculate throughput limits
llm_latency_seconds = 2 # Average LLM response time
concurrent_requests = 10 # Parallel requests allowed
max_throughput = concurrent_requests / llm_latency_seconds
# = 5 requests per second per instance
Cost per Query
Often the most important metric for AI systems.
| Component | Cost Factor |
|---|---|
| Input tokens | $0.01-0.03 per 1K tokens (GPT-4) |
| Output tokens | $0.03-0.06 per 1K tokens (GPT-4) |
| Embedding | $0.0001 per 1K tokens |
| Vector DB | $0.05-0.20 per million vectors/month |
Scaling Strategies
1. Request Batching
Group multiple requests into one LLM call.
# Instead of 10 separate calls
for query in queries:
response = llm.complete(query)
# Batch into one call
combined_prompt = "\n---\n".join([
f"Query {i}: {q}" for i, q in enumerate(queries)
])
batch_response = llm.complete(combined_prompt)
2. Model Routing
Use cheaper models for simple tasks.
def route_to_model(query: str) -> str:
complexity = estimate_complexity(query)
if complexity == "simple":
return "gpt-3.5-turbo" # Fast, cheap
elif complexity == "medium":
return "gpt-4-turbo" # Balanced
else:
return "gpt-4" # Best quality
3. Caching
Cache identical or similar queries.
Cache hit rates in production:
- Exact match: 20-40% hit rate
- Semantic cache: 40-60% hit rate
4. Async Processing
Don't block on slow operations.
# Synchronous (blocks)
result = await llm.complete(prompt) # Wait 2s
# Async with queue
job_id = queue.enqueue(llm.complete, prompt)
return {"status": "processing", "job_id": job_id}
# Client polls for result
Bottleneck Identification
In interviews, always identify the bottleneck:
| Bottleneck | Symptom | Solution |
|---|---|---|
| LLM rate limits | 429 errors | Multiple API keys, provider fallback |
| LLM latency | Slow responses | Streaming, caching, smaller models |
| Vector DB | Slow retrieval | Index optimization, sharding |
| Memory | OOM errors | Streaming, pagination |
Next, we'll learn a structured framework for approaching any design problem. :::