AI System Design Fundamentals
Design Framework
Having a structured approach separates good candidates from great ones. The RADIO framework provides a systematic way to tackle any AI system design problem.
The RADIO Framework
| Step | Focus | Time |
|---|---|---|
| Requirements | What are we building? | 5-10 min |
| Architecture | High-level components | 10-15 min |
| Data | Storage, flow, models | 5-10 min |
| Infrastructure | Scaling, deployment | 5-10 min |
| Operations | Monitoring, safety | 5 min |
R - Requirements
Always start here. Ask clarifying questions:
Functional Requirements:
- What is the primary user interaction?
- What inputs and outputs are expected?
- Are there accuracy requirements?
Non-Functional Requirements:
- Expected QPS (queries per second)?
- Acceptable latency (p50, p99)?
- Budget constraints?
- Compliance requirements (GDPR, HIPAA)?
Example dialogue:
Interviewer: "Design a document Q&A system."
You: "Before I start, I'd like to clarify a few things:
- How large are the documents? Single pages or hundreds of pages?
- Do we need to cite sources in our answers?
- What's the expected latency? Sub-second or is 5-10 seconds acceptable?
- How many concurrent users should we support?"
A - Architecture
Draw the high-level system:
┌──────────────────────────────────────────────────────────┐
│ Document Q&A System │
├──────────────────────────────────────────────────────────┤
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────────┐ │
│ │ User │───▶│ API │───▶│ Query Processor │ │
│ └─────────┘ └─────────┘ └─────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────────┐ │
│ │ Vector │◀───│Embedding│◀───│ Retriever │ │
│ │ DB │ │ Model │ └─────────────────────┘ │
│ └─────────┘ └─────────┘ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ LLM (Generation) │ │
│ └─────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────┐ │
│ │ Response + Citations │ │
│ └─────────────────────────┘ │
└──────────────────────────────────────────────────────────┘
D - Data
Define data models and flow:
# Core data models
class Document:
id: str
content: str
metadata: dict # source, date, author
chunks: List[Chunk]
class Chunk:
id: str
document_id: str
content: str
embedding: List[float]
position: int # For citation
class Query:
id: str
text: str
embedding: List[float]
retrieved_chunks: List[Chunk]
response: str
Data flow:
- Documents ingested → chunked → embedded → stored
- Query received → embedded → similar chunks retrieved
- Chunks + query → LLM → response with citations
I - Infrastructure
Discuss scaling and deployment:
| Component | Scaling Strategy |
|---|---|
| API Layer | Horizontal, auto-scale on CPU |
| Vector DB | Sharding by document collection |
| LLM Calls | Multiple API keys, provider fallback |
| Cache | Redis cluster, replicated |
Cost estimation:
Daily queries: 100,000
Avg tokens per query: 2,000 (input) + 500 (output)
LLM cost: 100K × ($0.01 × 2 + $0.03 × 0.5) = $3,500/day
Vector DB: $100/month
Infrastructure: $500/month
Total: ~$110,000/month
O - Operations
Cover monitoring and safety:
Monitoring:
- Latency percentiles (p50, p95, p99)
- Error rates by type
- Cache hit rates
- Cost per query
Safety:
- Input validation (length, content filtering)
- Output guardrails (PII detection, harmful content)
- Rate limiting per user
Evaluation:
- Automated metrics (retrieval accuracy, response relevance)
- Human evaluation sampling
- A/B testing for prompt changes
Framework in Action
Practice: When given a design problem, write down RADIO vertically and fill in each section. This keeps you organized and ensures you don't miss critical aspects.
Now that you have the fundamentals, let's dive into LLM application architecture. :::