Production Monitoring and Incident Response

Observability for AI-Generated Code

AI-generated code requires enhanced observability to:

Detect unexpected behaviors quickly
Understand complex interactions
Trace issues through the system
Learn from production patterns

The Three Pillars of Observability

1. Structured Logging

// lib/logger.ts
import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label })
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  base: {
    service: process.env.SERVICE_NAME,
    version: process.env.APP_VERSION,
    environment: process.env.NODE_ENV
  }
});

// Contextual logging for tracing
export function createRequestLogger(requestId: string, userId?: string) {
  return logger.child({
    requestId,
    userId,
    traceId: getTraceId()
  });
}

// Usage in handlers
async function handleOrder(req: Request) {
  const log = createRequestLogger(req.id, req.user?.id);

  log.info({ orderId: order.id }, 'Processing order');

  try {
    const result = await processOrder(order);
    log.info({ orderId: order.id, result: 'success' }, 'Order completed');
    return result;
  } catch (error) {
    log.error({
      orderId: order.id,
      error: error.message,
      stack: error.stack
    }, 'Order processing failed');
    throw error;
  }
}

2. Metrics Collection

// lib/metrics.ts
import { Counter, Histogram, Registry } from 'prom-client';

const registry = new Registry();

// Request metrics
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});

const httpRequestTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status_code']
});

// Business metrics
const ordersProcessed = new Counter({
  name: 'orders_processed_total',
  help: 'Total orders processed',
  labelNames: ['status', 'payment_method']
});

const paymentAmount = new Histogram({
  name: 'payment_amount_dollars',
  help: 'Payment amounts in dollars',
  labelNames: ['currency'],
  buckets: [10, 50, 100, 500, 1000, 5000]
});

registry.registerMetric(httpRequestDuration);
registry.registerMetric(httpRequestTotal);
registry.registerMetric(ordersProcessed);
registry.registerMetric(paymentAmount);

3. Distributed Tracing

// lib/tracing.ts
import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('my-service');

export async function withTracing<T>(
  name: string,
  attributes: Record<string, string>,
  fn: () => Promise<T>
): Promise<T> {
  const span = tracer.startSpan(name, { attributes });

  try {
    const result = await fn();
    span.setStatus({ code: SpanStatusCode.OK });
    return result;
  } catch (error) {
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message
    });
    span.recordException(error);
    throw error;
  } finally {
    span.end();
  }
}

// Usage
async function processPayment(orderId: string, amount: number) {
  return withTracing(
    'process-payment',
    { orderId, amount: String(amount) },
    async () => {
      // Payment logic here
    }
  );
}

AI-Specific Monitoring

Detecting Anomalous Patterns

// monitoring/anomaly-detection.ts
interface MetricWindow {
  mean: number;
  stdDev: number;
  samples: number[];
}

class AnomalyDetector {
  private windows: Map<string, MetricWindow> = new Map();
  private threshold = 3; // Standard deviations

  record(metric: string, value: number): void {
    const window = this.windows.get(metric) || {
      mean: value,
      stdDev: 0,
      samples: []
    };

    window.samples.push(value);
    if (window.samples.length > 100) {
      window.samples.shift();
    }

    // Update statistics
    window.mean = this.mean(window.samples);
    window.stdDev = this.stdDev(window.samples, window.mean);

    this.windows.set(metric, window);

    // Check for anomaly
    const zScore = Math.abs((value - window.mean) / window.stdDev);
    if (zScore > this.threshold && window.samples.length > 50) {
      this.alertAnomaly(metric, value, zScore);
    }
  }

  private alertAnomaly(metric: string, value: number, zScore: number): void {
    logger.warn({
      metric,
      value,
      zScore,
      mean: this.windows.get(metric)?.mean
    }, 'Anomaly detected');

    // Trigger alert
    alertManager.trigger({
      severity: 'warning',
      title: `Anomaly in ${metric}`,
      description: `Value ${value} is ${zScore.toFixed(1)} std devs from mean`
    });
  }
}

Incident Response Procedures

Runbook Template

# Incident Response Runbook: High Error Rate

## Detection
- Alert: Error rate > 1% for 5 minutes
- Dashboard: https://grafana.example.com/errors
- Affected services: API, Payment

## Severity Classification
- P1 (Critical): >5% error rate, user-facing impact
- P2 (High): 1-5% error rate, degraded experience
- P3 (Medium): <1% error rate, limited impact

## Immediate Actions

### Step 1: Assess Impact
```bash
# Check current error rate
curl -s localhost:9090/api/v1/query \
  -d 'query=rate(http_requests_total{status=~"5.."}[5m])' | jq

# Check affected endpoints
curl -s localhost:9090/api/v1/query \
  -d 'query=topk(5, rate(http_requests_total{status=~"5.."}[5m]))'

Step 2: Check Recent Deployments

# List recent deployments
kubectl rollout history deployment/api

# Check deployment timing vs error spike
# If correlated, consider rollback

Step 3: Rollback if Needed

# Rollback to previous version
kubectl rollout undo deployment/api

# Verify rollback
kubectl rollout status deployment/api

Step 4: Investigate Root Cause

# Get error logs
kubectl logs -l app=api --since=30m | grep ERROR

# Check for patterns
kubectl logs -l app=api --since=30m | grep ERROR | sort | uniq -c | sort -rn

Post-Incident

Document timeline in incident report
Identify root cause
Create tickets for fixes
Schedule post-mortem


### Automated Incident Creation

```typescript
// incidents/auto-create.ts
interface IncidentData {
  title: string;
  severity: 'P1' | 'P2' | 'P3';
  service: string;
  description: string;
  runbookUrl: string;
}

async function createIncident(data: IncidentData): Promise<string> {
  // Create PagerDuty incident
  const incident = await pagerduty.createIncident({
    title: data.title,
    service_key: process.env.PD_SERVICE_KEY,
    severity: data.severity,
    details: {
      service: data.service,
      description: data.description,
      runbook: data.runbookUrl,
      dashboard: `https://grafana.example.com/d/${data.service}`
    }
  });

  // Create Slack channel
  const channel = await slack.createChannel(`inc-${incident.id}`);

  // Post initial context
  await slack.postMessage(channel.id, {
    blocks: [
      {
        type: 'header',
        text: { type: 'plain_text', text: `🚨 ${data.title}` }
      },
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: `*Severity:* ${data.severity}\n*Service:* ${data.service}\n*Runbook:* ${data.runbookUrl}`
        }
      }
    ]
  });

  return incident.id;
}

Post-Incident Learning

AI-Assisted Post-Mortem

claude "Analyze this incident and generate a post-mortem:

Timeline:
- 14:23 Alert fired for high error rate
- 14:25 On-call engineer acknowledged
- 14:32 Root cause identified (null pointer in payment handler)
- 14:35 Rollback initiated
- 14:38 Service recovered

Create a blameless post-mortem with:
1. Executive summary
2. Impact assessment
3. Timeline
4. Root cause analysis
5. Contributing factors
6. Action items to prevent recurrence"

Learning from Production

// monitoring/production-learner.ts

// Track patterns in production errors
interface ErrorPattern {
  signature: string;
  count: number;
  firstSeen: Date;
  lastSeen: Date;
  contexts: object[];
}

async function analyzeProductionPatterns(): Promise<void> {
  const errors = await getRecentErrors(24 * 60);  // Last 24 hours
  const patterns = groupBySignature(errors);

  // Generate insights
  const prompt = `
    Analyze these error patterns from production:
    ${JSON.stringify(patterns)}

    Identify:
    1. Common root causes
    2. Patterns that suggest code issues
    3. Patterns that suggest infrastructure issues
    4. Recommendations for code improvements
  `;

  const insights = await claude.analyze(prompt);

  // Create tickets for actionable insights
  for (const insight of insights.actionItems) {
    await createJiraTicket({
      title: insight.title,
      description: insight.description,
      priority: insight.severity
    });
  }
}

Module Summary

You've learned to:

Configure CI/CD pipelines for AI-generated code
Implement safe rollout strategies
Monitor and respond to production incidents

Next, we'll explore real-world case studies of advanced vibe coding in action. :::

Observability for AI-Generated Code

The Three Pillars of Observability

1. Structured Logging

2. Metrics Collection

3. Distributed Tracing

AI-Specific Monitoring

Detecting Anomalous Patterns

Incident Response Procedures

Runbook Template

Step 2: Check Recent Deployments

Step 3: Rollback if Needed

Step 4: Investigate Root Cause

Post-Incident

Post-Incident Learning

AI-Assisted Post-Mortem

Learning from Production

Module Summary

Quiz

Stay on the Nerd Track