Interview Case Studies
Case Study: Code Review Agent
4 min read
This case study focuses on a technical problem with unique challenges: evaluating AI accuracy, handling large codebases, and ensuring developer trust.
The Interview Question
"Design an AI code review system for a company with 500 engineers. The system should automatically review pull requests, identify bugs and security issues, and suggest improvements while integrating with existing workflows."
Step 1: Requirements (R)
Clarifying questions:
- Primary languages? (Python, TypeScript, Go)
- Existing tooling? (GitHub, CI/CD with Jenkins)
- What aspects to review? (bugs, security, style, performance)
- Human-in-the-loop requirement? (AI assists, human approves)
- Latency requirements? (review within 5 minutes of PR creation)
Functional Requirements:
- Analyze PR diffs for bugs, security issues, style violations
- Provide line-by-line comments with explanations
- Suggest fixes with code snippets
- Learn from accepted/rejected suggestions
- Support multi-file context (understand cross-file changes)
Non-Functional Requirements:
- Process 200 PRs/day with average 500 lines changed
- < 5 minute review time for most PRs
- False positive rate < 20% (or developers ignore it)
- Integration with GitHub PR workflow
Step 2: Architecture (A)
┌─────────────────────────────────────────────────────────────────────┐
│ GitHub Webhook │
│ (PR Created / Updated Events) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ PR Processing Queue │
│ (Redis / SQS) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Code Review Orchestrator │
│ │
│ 1. Fetch PR diff + context files │
│ 2. Chunk large diffs │
│ 3. Dispatch to specialized analyzers │
│ 4. Aggregate and deduplicate findings │
└─────────────────────────────────────────────────────────────────────┘
│
┌───────────┼───────────┬────────────────┐
▼ ▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Bug │ │ Security │ │ Style │ │ Performance │
│ Detector │ │ Scanner │ │ Checker │ │ Analyzer │
│ │ │ │ │ │ │ │
│ (LLM-based) │ │(Rules+LLM) │ │(Linter+LLM) │ │ (LLM-based) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Finding Aggregator │
│ - Deduplicate similar findings │
│ - Rank by severity and confidence │
│ - Apply repository-specific rules │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ GitHub Comment Poster │
│ - Post inline comments on specific lines │
│ - Post summary comment on PR │
└─────────────────────────────────────────────────────────────────────┘
Step 3: Data (D)
Context Retrieval Strategy:
class CodeContextRetriever:
def __init__(self, repo_indexer):
self.indexer = repo_indexer
async def get_review_context(self, pr_diff: dict) -> dict:
context = {
"changed_files": [],
"related_files": [],
"function_definitions": [],
"type_definitions": [],
"test_files": []
}
for file_change in pr_diff["files"]:
# Get the full file (not just diff)
context["changed_files"].append(
await self.indexer.get_file(file_change["path"])
)
# Find related files (imports, callers)
related = await self.indexer.find_related(
file_change["path"],
max_files=5
)
context["related_files"].extend(related)
# Get function/class definitions for changed code
symbols = await self.indexer.get_symbols(
file_change["path"],
lines=file_change["changed_lines"]
)
context["function_definitions"].extend(symbols)
# Find corresponding test files
test_file = await self.indexer.find_test_file(
file_change["path"]
)
if test_file:
context["test_files"].append(test_file)
return context
Codebase Indexing:
indexing_strategy = {
"symbol_index": {
"tool": "tree-sitter",
"stores": ["function_definitions", "class_definitions", "imports"],
"update": "on_merge_to_main"
},
"semantic_index": {
"tool": "embeddings",
"model": "code-embedding-model",
"chunk_by": "function",
"update": "daily"
},
"dependency_graph": {
"tool": "custom_analyzer",
"stores": ["imports", "function_calls", "inheritance"],
"update": "on_merge_to_main"
}
}
Step 4: Specialized Analyzers
class BugDetector:
def __init__(self, llm):
self.llm = llm
self.prompt = """
Analyze this code change for potential bugs.
Changed code:
{diff}
Full file context:
{file_context}
Related code:
{related_context}
For each potential bug found, provide:
1. Line number
2. Bug type (null_reference, off_by_one, race_condition, etc.)
3. Severity (critical, high, medium, low)
4. Explanation
5. Suggested fix
Only report issues you are confident about. Avoid false positives.
"""
async def analyze(self, diff: str, context: dict) -> list:
response = await self.llm.complete(
self.prompt.format(
diff=diff,
file_context=context["changed_files"],
related_context=context["related_files"]
)
)
return self._parse_findings(response)
class SecurityScanner:
def __init__(self, llm, rule_engine):
self.llm = llm
self.rules = rule_engine # Semgrep, CodeQL
async def analyze(self, diff: str, context: dict) -> list:
findings = []
# Rule-based scanning (fast, low false positive)
rule_findings = await self.rules.scan(diff)
findings.extend(rule_findings)
# LLM for complex patterns
llm_findings = await self._llm_security_check(diff, context)
findings.extend(llm_findings)
return self._deduplicate(findings)
async def _llm_security_check(self, diff: str, context: dict) -> list:
prompt = """
Security review for this code change.
Focus on:
- SQL injection
- XSS vulnerabilities
- Authentication/authorization issues
- Sensitive data exposure
- Insecure dependencies
Code:
{diff}
Only report high-confidence security issues.
"""
response = await self.llm.complete(prompt.format(diff=diff))
return self._parse_security_findings(response)
Step 5: Evaluation & Feedback Loop
Accuracy Measurement:
class ReviewAccuracyTracker:
def __init__(self, db):
self.db = db
async def track_suggestion(self, suggestion_id: str, pr_id: str):
"""Track each suggestion made by the system."""
await self.db.insert("suggestions", {
"id": suggestion_id,
"pr_id": pr_id,
"timestamp": datetime.utcnow(),
"status": "pending"
})
async def record_outcome(self, suggestion_id: str, outcome: str):
"""Record developer response to suggestion."""
# outcome: "accepted", "rejected", "modified", "ignored"
await self.db.update("suggestions", suggestion_id, {
"status": outcome,
"resolved_at": datetime.utcnow()
})
async def get_metrics(self, time_range: str = "7d") -> dict:
suggestions = await self.db.query(
"suggestions",
time_range=time_range
)
total = len(suggestions)
accepted = sum(1 for s in suggestions if s["status"] == "accepted")
rejected = sum(1 for s in suggestions if s["status"] == "rejected")
return {
"total_suggestions": total,
"acceptance_rate": accepted / total if total > 0 else 0,
"rejection_rate": rejected / total if total > 0 else 0,
"precision": accepted / (accepted + rejected) if (accepted + rejected) > 0 else 0
}
Continuous Improvement:
improvement_pipeline = {
"data_collection": {
"accepted_suggestions": "High-quality training examples",
"rejected_with_explanation": "Negative examples",
"human_comments_not_caught": "Missing patterns"
},
"weekly_review": {
"false_positive_analysis": "Why did we flag this incorrectly?",
"false_negative_analysis": "What did we miss?",
"prompt_refinement": "Adjust prompts based on patterns"
},
"monthly_fine_tuning": {
"collect_examples": "Accepted suggestions + human reviews",
"fine_tune_model": "Improve domain-specific accuracy",
"a_b_test": "Compare against baseline"
}
}
Trade-offs Analysis
| Decision | Trade-off | Choice | Reason |
|---|---|---|---|
| Full file vs. diff only | Context vs. token cost | Full file | Better accuracy worth the cost |
| Single model vs. specialized | Simplicity vs. accuracy | Specialized | Different tasks need different prompts |
| Block PR vs. advisory | Friction vs. safety | Advisory | Build trust first |
| Real-time vs. batch | Latency vs. efficiency | Real-time | Developer workflow expectation |
Interview Tip
Key points for this case study:
- Accuracy is critical - Developers will ignore a noisy tool
- Context matters - Code review needs cross-file understanding
- Feedback loop - Track acceptance/rejection for improvement
- Hybrid approach - Combine rules (Semgrep) with LLM
Finally, let's cover interview tips and common mistakes to avoid. :::