Case Study: Code Review Agent

This case study focuses on a technical problem with unique challenges: evaluating AI accuracy, handling large codebases, and ensuring developer trust.

The Interview Question

"Design an AI code review system for a company with 500 engineers. The system should automatically review pull requests, identify bugs and security issues, and suggest improvements while integrating with existing workflows."

Step 1: Requirements (R)

Clarifying questions:

Primary languages? (Python, TypeScript, Go)
Existing tooling? (GitHub, CI/CD with Jenkins)
What aspects to review? (bugs, security, style, performance)
Human-in-the-loop requirement? (AI assists, human approves)
Latency requirements? (review within 5 minutes of PR creation)

Functional Requirements:

Analyze PR diffs for bugs, security issues, style violations
Provide line-by-line comments with explanations
Suggest fixes with code snippets
Learn from accepted/rejected suggestions
Support multi-file context (understand cross-file changes)

Non-Functional Requirements:

Process 200 PRs/day with average 500 lines changed
< 5 minute review time for most PRs
False positive rate < 20% (or developers ignore it)
Integration with GitHub PR workflow

Step 2: Architecture (A)

┌─────────────────────────────────────────────────────────────────────┐
│                     GitHub Webhook                                   │
│                (PR Created / Updated Events)                         │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     PR Processing Queue                              │
│                      (Redis / SQS)                                   │
└─────────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
┌─────────────────────────────────────────────────────────────────────┐
│                     Code Review Orchestrator                         │
│                                                                      │
│  1. Fetch PR diff + context files                                   │
│  2. Chunk large diffs                                               │
│  3. Dispatch to specialized analyzers                               │
│  4. Aggregate and deduplicate findings                              │
└─────────────────────────────────────────────────────────────────────┘
                    │
        ┌───────────┼───────────┬────────────────┐
        ▼           ▼           ▼                ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│   Bug       │ │  Security   │ │   Style     │ │ Performance │
│  Detector   │ │  Scanner    │ │  Checker    │ │  Analyzer   │
│             │ │             │ │             │ │             │
│ (LLM-based) │ │(Rules+LLM)  │ │(Linter+LLM) │ │ (LLM-based) │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Finding Aggregator                                │
│  - Deduplicate similar findings                                     │
│  - Rank by severity and confidence                                  │
│  - Apply repository-specific rules                                  │
└─────────────────────────────────────────────────────────────────────┘
                    │
                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    GitHub Comment Poster                             │
│  - Post inline comments on specific lines                           │
│  - Post summary comment on PR                                       │
└─────────────────────────────────────────────────────────────────────┘

Step 3: Data (D)

Context Retrieval Strategy:

class CodeContextRetriever:
    def __init__(self, repo_indexer):
        self.indexer = repo_indexer

    async def get_review_context(self, pr_diff: dict) -> dict:
        context = {
            "changed_files": [],
            "related_files": [],
            "function_definitions": [],
            "type_definitions": [],
            "test_files": []
        }

        for file_change in pr_diff["files"]:
            # Get the full file (not just diff)
            context["changed_files"].append(
                await self.indexer.get_file(file_change["path"])
            )

            # Find related files (imports, callers)
            related = await self.indexer.find_related(
                file_change["path"],
                max_files=5
            )
            context["related_files"].extend(related)

            # Get function/class definitions for changed code
            symbols = await self.indexer.get_symbols(
                file_change["path"],
                lines=file_change["changed_lines"]
            )
            context["function_definitions"].extend(symbols)

            # Find corresponding test files
            test_file = await self.indexer.find_test_file(
                file_change["path"]
            )
            if test_file:
                context["test_files"].append(test_file)

        return context

Codebase Indexing:

indexing_strategy = {
    "symbol_index": {
        "tool": "tree-sitter",
        "stores": ["function_definitions", "class_definitions", "imports"],
        "update": "on_merge_to_main"
    },
    "semantic_index": {
        "tool": "embeddings",
        "model": "code-embedding-model",
        "chunk_by": "function",
        "update": "daily"
    },
    "dependency_graph": {
        "tool": "custom_analyzer",
        "stores": ["imports", "function_calls", "inheritance"],
        "update": "on_merge_to_main"
    }
}

Step 4: Specialized Analyzers

class BugDetector:
    def __init__(self, llm):
        self.llm = llm
        self.prompt = """
        Analyze this code change for potential bugs.

        Changed code:
        {diff}

        Full file context:
        {file_context}

        Related code:
        {related_context}

        For each potential bug found, provide:
        1. Line number
        2. Bug type (null_reference, off_by_one, race_condition, etc.)
        3. Severity (critical, high, medium, low)
        4. Explanation
        5. Suggested fix

        Only report issues you are confident about. Avoid false positives.
        """

    async def analyze(self, diff: str, context: dict) -> list:
        response = await self.llm.complete(
            self.prompt.format(
                diff=diff,
                file_context=context["changed_files"],
                related_context=context["related_files"]
            )
        )
        return self._parse_findings(response)


class SecurityScanner:
    def __init__(self, llm, rule_engine):
        self.llm = llm
        self.rules = rule_engine  # Semgrep, CodeQL

    async def analyze(self, diff: str, context: dict) -> list:
        findings = []

        # Rule-based scanning (fast, low false positive)
        rule_findings = await self.rules.scan(diff)
        findings.extend(rule_findings)

        # LLM for complex patterns
        llm_findings = await self._llm_security_check(diff, context)
        findings.extend(llm_findings)

        return self._deduplicate(findings)

    async def _llm_security_check(self, diff: str, context: dict) -> list:
        prompt = """
        Security review for this code change.

        Focus on:
        - SQL injection
        - XSS vulnerabilities
        - Authentication/authorization issues
        - Sensitive data exposure
        - Insecure dependencies

        Code:
        {diff}

        Only report high-confidence security issues.
        """
        response = await self.llm.complete(prompt.format(diff=diff))
        return self._parse_security_findings(response)

Step 5: Evaluation & Feedback Loop

Accuracy Measurement:

class ReviewAccuracyTracker:
    def __init__(self, db):
        self.db = db

    async def track_suggestion(self, suggestion_id: str, pr_id: str):
        """Track each suggestion made by the system."""
        await self.db.insert("suggestions", {
            "id": suggestion_id,
            "pr_id": pr_id,
            "timestamp": datetime.utcnow(),
            "status": "pending"
        })

    async def record_outcome(self, suggestion_id: str, outcome: str):
        """Record developer response to suggestion."""
        # outcome: "accepted", "rejected", "modified", "ignored"
        await self.db.update("suggestions", suggestion_id, {
            "status": outcome,
            "resolved_at": datetime.utcnow()
        })

    async def get_metrics(self, time_range: str = "7d") -> dict:
        suggestions = await self.db.query(
            "suggestions",
            time_range=time_range
        )

        total = len(suggestions)
        accepted = sum(1 for s in suggestions if s["status"] == "accepted")
        rejected = sum(1 for s in suggestions if s["status"] == "rejected")

        return {
            "total_suggestions": total,
            "acceptance_rate": accepted / total if total > 0 else 0,
            "rejection_rate": rejected / total if total > 0 else 0,
            "precision": accepted / (accepted + rejected) if (accepted + rejected) > 0 else 0
        }

Continuous Improvement:

improvement_pipeline = {
    "data_collection": {
        "accepted_suggestions": "High-quality training examples",
        "rejected_with_explanation": "Negative examples",
        "human_comments_not_caught": "Missing patterns"
    },
    "weekly_review": {
        "false_positive_analysis": "Why did we flag this incorrectly?",
        "false_negative_analysis": "What did we miss?",
        "prompt_refinement": "Adjust prompts based on patterns"
    },
    "monthly_fine_tuning": {
        "collect_examples": "Accepted suggestions + human reviews",
        "fine_tune_model": "Improve domain-specific accuracy",
        "a_b_test": "Compare against baseline"
    }
}

Trade-offs Analysis

Decision	Trade-off	Choice	Reason
Full file vs. diff only	Context vs. token cost	Full file	Better accuracy worth the cost
Single model vs. specialized	Simplicity vs. accuracy	Specialized	Different tasks need different prompts
Block PR vs. advisory	Friction vs. safety	Advisory	Build trust first
Real-time vs. batch	Latency vs. efficiency	Real-time	Developer workflow expectation

Interview Tip

Key points for this case study:

Accuracy is critical - Developers will ignore a noisy tool

Context matters - Code review needs cross-file understanding

Feedback loop - Track acceptance/rejection for improvement

Hybrid approach - Combine rules (Semgrep) with LLM

Finally, let's cover interview tips and common mistakes to avoid. :::