Root Cause Analysis with AI

Beyond Simple Debugging

Complex bugs often have symptoms far from their causes. AI assistants excel at:

Correlating symptoms across files and systems
Recognizing patterns from similar bugs
Tracing execution paths systematically
Suggesting hypotheses to test

The AI-Assisted Debugging Process

Step 1: Symptom Collection

claude "I'm seeing an issue where orders sometimes fail
silently. Help me collect relevant information:
1. Recent error logs
2. Related code sections
3. Database state during failures
4. External API responses"

AI creates a diagnostic checklist:

## Diagnostic Collection Plan

### Error Logs (run these commands)
```bash
grep -i "order" logs/app.log | grep -i "error" | tail -100
grep "OrderService" logs/app.log | grep -E "(WARN|ERROR)" | tail -50

Code to Review

src/services/order.ts (order creation)
src/services/payment.ts (payment processing)
src/middleware/error.ts (error handling)

Database Queries

SELECT * FROM orders WHERE status = 'failed'
  AND created_at > NOW() - INTERVAL '24 hours';
SELECT * FROM payment_intents WHERE order_id IN (
  SELECT id FROM orders WHERE status = 'failed'
);

API Response Logging

Check if Stripe webhook responses are logged at: logs/webhooks/stripe/*.log


### Step 2: Pattern Recognition

```bash
claude "Here are the error logs from failed orders.
Identify patterns:
- Common timestamps or intervals
- Shared user attributes
- Payment method correlations
- Request path similarities"

AI identifies patterns:

## Pattern Analysis

### Temporal Pattern
- 80% of failures occur between 2:00-4:00 AM UTC
- Correlates with nightly batch job timing
- Peak failure rate: 3:15 AM (database backup window)

### User Pattern
- No correlation with user type or location
- Affects both new and returning customers

### Payment Pattern
- 95% of failures involve 3D Secure cards
- European banks disproportionately affected
- Specific BINs: 4242**, 5555** (test cards work fine)

### Request Pattern
- All failed orders have response time > 25s
- Normal orders: avg 2.3s response time

### Hypothesis
The 30-second timeout on payment confirmation conflicts
with 3D Secure flow, which can take 45+ seconds for
European banks, especially during backup window when
DB is under load.

Step 3: Hypothesis Testing

claude "Your hypothesis about the 30s timeout seems likely.
Help me verify:
1. Find where the timeout is set
2. Check if there's retry logic
3. Look for related issues in our history
4. Suggest a test to confirm"

// AI finds the culprit
// src/services/payment.ts:145

const paymentResult = await stripe.paymentIntents.confirm(
  intentId,
  { timeout: 30000 } // 30 second timeout - TOO SHORT!
);

// No retry logic found
// Related commits show this was added for "performance"
// in commit abc123 on 2025-08-15

Step 4: Fix Verification

claude "I've increased the timeout to 60s and added retry
logic. Help me:
1. Write a test that reproduces the original issue
2. Verify the fix handles the edge case
3. Check for any side effects of the change"

Common Debugging Scenarios

Race Condition Detection

claude "Users report duplicate entries appearing randomly.
Analyze this code for race conditions:
[paste code]"

AI identifies race condition:

## Race Condition Found

Location: src/services/inventory.ts:78

```typescript
// PROBLEM: Check-then-act race condition
async function reserveItem(itemId: string) {
  const item = await db.items.findUnique({ where: { id: itemId }});

  if (item.quantity > 0) {  // Check
    // ⚠️ Another request can modify between check and update
    await db.items.update({  // Act
      where: { id: itemId },
      data: { quantity: item.quantity - 1 }
    });
  }
}

Fix: Use atomic operation

async function reserveItem(itemId: string) {
  const result = await db.items.updateMany({
    where: {
      id: itemId,
      quantity: { gt: 0 }  // Atomic check-and-update
    },
    data: { quantity: { decrement: 1 } }
  });

  if (result.count === 0) {
    throw new OutOfStockError(itemId);
  }
}


### Memory Leak Investigation

```bash
claude "Our Node.js server memory grows over time.
Analyze these heap snapshots and identify the leak."

## Memory Leak Analysis

### Growing Objects
1. EventEmitter listeners: +500 every hour
2. Cached database connections: not being released
3. Request context objects: held in closure

### Root Cause
src/middleware/context.ts:23

```typescript
const requestContexts = new Map();  // Never cleaned!

function createContext(req) {
  const ctx = { user: req.user, timestamp: Date.now() };
  requestContexts.set(req.id, ctx);  // Added
  return ctx;
  // Missing: cleanup after request completes
}

Fix

function createContext(req, res) {
  const ctx = { user: req.user, timestamp: Date.now() };
  requestContexts.set(req.id, ctx);

  // Cleanup on request end
  res.on('finish', () => requestContexts.delete(req.id));

  return ctx;
}


### Deadlock Analysis

```bash
claude "Database operations are hanging intermittently.
Check for potential deadlocks in our transaction code."

Building a Debugging Toolkit

Create Reusable Diagnostic Scripts

claude "Create a diagnostic script that:
1. Checks service health
2. Reports database connection pool status
3. Shows recent error rates
4. Identifies slow queries"

AI generates scripts/diagnose.sh:

#!/bin/bash
echo "=== Service Health ==="
curl -s localhost:3000/health | jq .

echo "=== DB Connection Pool ==="
psql $DATABASE_URL -c "SELECT * FROM pg_stat_activity
  WHERE application_name = 'myapp';"

echo "=== Error Rate (last hour) ==="
grep -c "ERROR" logs/app.log | awk '{print $1 " errors/hour"}'

echo "=== Slow Queries (>1s) ==="
grep "duration:" logs/db.log | awk '$NF > 1000' | tail -10

Best Practices for AI-Assisted Debugging

Collect Before Hypothesizing: Gather data first
Share Full Context: Include logs, stack traces, reproduction steps
Test Hypotheses Systematically: One variable at a time
Document Findings: Create post-mortems for complex bugs
Build Institutional Memory: Save diagnostic patterns

Next Steps

In the next lesson, we'll cover performance optimization techniques using AI analysis. :::