Interview Case Studies
Case Study: AI Customer Support
3 min read
Let's walk through a complete system design interview for an AI customer support agent. This demonstrates how to apply the RADIO framework to a real problem.
The Interview Question
"Design an AI-powered customer support system for an e-commerce company that handles 100,000 support tickets per day. The system should resolve simple issues automatically while escalating complex ones to human agents."
Step 1: Requirements (R)
Clarifying questions to ask:
- What types of tickets? (orders, returns, account issues, product questions)
- What's the target automation rate? (assume 70%)
- What languages? (English + Spanish initially)
- SLA requirements? (first response < 30 seconds)
- Budget constraints? (assume $50k/month for AI)
Functional Requirements:
- Classify incoming tickets by type and urgency
- Auto-resolve common issues (order status, return policy)
- Escalate complex issues with context summary
- Support multi-turn conversations
- Learn from human agent resolutions
Non-Functional Requirements:
- 99.9% uptime
- < 2 second response time
- Handle 100k tickets/day (~1.2 tickets/second average, 5x peak)
- Cost under $0.50 per resolved ticket
Step 2: Architecture (A)
┌─────────────────────────────────────────────────────────────────────┐
│ Customer Channels │
│ (Chat Widget, Email, Mobile App) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ API Gateway + Rate Limiting │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Ticket Router Service │
│ - Intent Classification (fine-tuned classifier) │
│ - Priority Assignment │
│ - Language Detection │
└─────────────────────────────────────────────────────────────────────┘
│ │
(Automated)│ (Complex)│
▼ ▼
┌─────────────────────────┐ ┌────────────────────────────────────┐
│ Auto-Resolution │ │ Human Escalation │
│ Agent │ │ │
│ │ │ - Context Summary Generator │
│ - RAG for policies │ │ - Suggested Responses │
│ - Order lookup tool │ │ - Human Agent Queue │
│ - Return initiation │ │ │
└─────────────────────────┘ └────────────────────────────────────┘
│ │
└──────────────┬───────────────┘
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Response Delivery │
│ + Feedback Collection │
└─────────────────────────────────────────────────────────────────────┘
Step 3: Data (D)
Knowledge Base (RAG):
knowledge_sources = {
"policies": {
"source": "internal_docs",
"update_frequency": "daily",
"chunks": 5000,
"examples": ["return policy", "shipping times", "warranty"]
},
"product_catalog": {
"source": "product_db",
"update_frequency": "real-time",
"chunks": 50000,
"examples": ["product specs", "compatibility", "availability"]
},
"past_resolutions": {
"source": "ticket_history",
"update_frequency": "weekly",
"chunks": 100000,
"examples": ["similar ticket resolutions", "agent responses"]
}
}
Tools for Agent:
tools = [
{
"name": "lookup_order",
"description": "Get order status, tracking, items",
"requires": ["order_id or email"]
},
{
"name": "initiate_return",
"description": "Start return process for eligible items",
"requires": ["order_id", "item_id", "reason"],
"side_effects": True
},
{
"name": "search_knowledge_base",
"description": "Search policies and product info",
"requires": ["query"]
},
{
"name": "escalate_to_human",
"description": "Transfer to human with context",
"requires": ["reason", "urgency"]
}
]
Step 4: Infrastructure (I)
Scaling Strategy:
scaling_config = {
"ticket_router": {
"type": "kubernetes_deployment",
"min_replicas": 3,
"max_replicas": 20,
"scale_metric": "requests_per_second",
"target": 100 # requests per pod
},
"auto_resolution_agent": {
"type": "kubernetes_deployment",
"min_replicas": 5,
"max_replicas": 50,
"scale_metric": "queue_depth",
"target": 10 # tickets per pod
},
"vector_database": {
"type": "managed_pinecone",
"pods": 3,
"replicas": 2
},
"llm_api": {
"primary": "gpt-4",
"fallback": "gpt-3.5-turbo",
"rate_limit": 10000 # requests per minute
}
}
Cost Estimation:
cost_breakdown = {
"llm_costs": {
"auto_resolve": "70k tickets × $0.10 = $7,000/day",
"summarization": "30k tickets × $0.05 = $1,500/day",
"daily_total": "$8,500",
"monthly_total": "$255,000" # Over budget!
},
"optimization_needed": {
"caching": "Cache common responses → 30% reduction",
"smaller_model": "Use GPT-3.5 for classification → 50% reduction",
"optimized_budget": "$45,000/month"
}
}
Step 5: Operations (O)
Key Metrics:
metrics_to_track = {
"automation_rate": {
"target": 0.70,
"alert_threshold": 0.60
},
"customer_satisfaction": {
"target": 4.5, # out of 5
"alert_threshold": 4.0
},
"resolution_time": {
"automated_target": "30 seconds",
"escalated_target": "4 hours"
},
"escalation_accuracy": {
"target": 0.95, # % of escalations that needed human
"false_escalation_cost": "$5 per ticket"
}
}
Safety Guardrails:
guardrails = {
"prohibited_actions": [
"Refund over $500 without approval",
"Access payment information",
"Make promises about delivery dates"
],
"required_escalation": [
"Customer mentions legal action",
"Sentiment is very negative (score < 0.2)",
"Issue involves safety concerns"
],
"human_review_queue": [
"First resolution for new issue types",
"Random 5% sample for quality"
]
}
Trade-offs Discussed
| Decision | Option A | Option B | Choice | Reason |
|---|---|---|---|---|
| Model | GPT-4 | GPT-3.5 | Hybrid | GPT-3.5 for routing, GPT-4 for resolution |
| RAG vs Fine-tune | RAG | Fine-tune | RAG | Policies change frequently |
| Sync vs Async | Sync | Queue | Sync | User expects immediate response |
Interview Tip
This case study demonstrates:
- Structured approach - RADIO framework keeps you organized
- Trade-off analysis - Show you understand constraints
- Cost awareness - Initial estimate was over budget, we optimized
- Safety first - Guardrails for automated actions
Next, let's explore a code review agent case study. :::