Production Deployment Workflows
Production Monitoring and Incident Response
5 min read
Observability for AI-Generated Code
AI-generated code requires enhanced observability to:
- Detect unexpected behaviors quickly
- Understand complex interactions
- Trace issues through the system
- Learn from production patterns
The Three Pillars of Observability
1. Structured Logging
// lib/logger.ts
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
formatters: {
level: (label) => ({ level: label })
},
timestamp: pino.stdTimeFunctions.isoTime,
base: {
service: process.env.SERVICE_NAME,
version: process.env.APP_VERSION,
environment: process.env.NODE_ENV
}
});
// Contextual logging for tracing
export function createRequestLogger(requestId: string, userId?: string) {
return logger.child({
requestId,
userId,
traceId: getTraceId()
});
}
// Usage in handlers
async function handleOrder(req: Request) {
const log = createRequestLogger(req.id, req.user?.id);
log.info({ orderId: order.id }, 'Processing order');
try {
const result = await processOrder(order);
log.info({ orderId: order.id, result: 'success' }, 'Order completed');
return result;
} catch (error) {
log.error({
orderId: order.id,
error: error.message,
stack: error.stack
}, 'Order processing failed');
throw error;
}
}
2. Metrics Collection
// lib/metrics.ts
import { Counter, Histogram, Registry } from 'prom-client';
const registry = new Registry();
// Request metrics
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]
});
const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code']
});
// Business metrics
const ordersProcessed = new Counter({
name: 'orders_processed_total',
help: 'Total orders processed',
labelNames: ['status', 'payment_method']
});
const paymentAmount = new Histogram({
name: 'payment_amount_dollars',
help: 'Payment amounts in dollars',
labelNames: ['currency'],
buckets: [10, 50, 100, 500, 1000, 5000]
});
registry.registerMetric(httpRequestDuration);
registry.registerMetric(httpRequestTotal);
registry.registerMetric(ordersProcessed);
registry.registerMetric(paymentAmount);
3. Distributed Tracing
// lib/tracing.ts
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
export async function withTracing<T>(
name: string,
attributes: Record<string, string>,
fn: () => Promise<T>
): Promise<T> {
const span = tracer.startSpan(name, { attributes });
try {
const result = await fn();
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}
// Usage
async function processPayment(orderId: string, amount: number) {
return withTracing(
'process-payment',
{ orderId, amount: String(amount) },
async () => {
// Payment logic here
}
);
}
AI-Specific Monitoring
Detecting Anomalous Patterns
// monitoring/anomaly-detection.ts
interface MetricWindow {
mean: number;
stdDev: number;
samples: number[];
}
class AnomalyDetector {
private windows: Map<string, MetricWindow> = new Map();
private threshold = 3; // Standard deviations
record(metric: string, value: number): void {
const window = this.windows.get(metric) || {
mean: value,
stdDev: 0,
samples: []
};
window.samples.push(value);
if (window.samples.length > 100) {
window.samples.shift();
}
// Update statistics
window.mean = this.mean(window.samples);
window.stdDev = this.stdDev(window.samples, window.mean);
this.windows.set(metric, window);
// Check for anomaly
const zScore = Math.abs((value - window.mean) / window.stdDev);
if (zScore > this.threshold && window.samples.length > 50) {
this.alertAnomaly(metric, value, zScore);
}
}
private alertAnomaly(metric: string, value: number, zScore: number): void {
logger.warn({
metric,
value,
zScore,
mean: this.windows.get(metric)?.mean
}, 'Anomaly detected');
// Trigger alert
alertManager.trigger({
severity: 'warning',
title: `Anomaly in ${metric}`,
description: `Value ${value} is ${zScore.toFixed(1)} std devs from mean`
});
}
}
Incident Response Procedures
Runbook Template
# Incident Response Runbook: High Error Rate
## Detection
- Alert: Error rate > 1% for 5 minutes
- Dashboard: https://grafana.example.com/errors
- Affected services: API, Payment
## Severity Classification
- P1 (Critical): >5% error rate, user-facing impact
- P2 (High): 1-5% error rate, degraded experience
- P3 (Medium): <1% error rate, limited impact
## Immediate Actions
### Step 1: Assess Impact
```bash
# Check current error rate
curl -s localhost:9090/api/v1/query \
-d 'query=rate(http_requests_total{status=~"5.."}[5m])' | jq
# Check affected endpoints
curl -s localhost:9090/api/v1/query \
-d 'query=topk(5, rate(http_requests_total{status=~"5.."}[5m]))'
Step 2: Check Recent Deployments
# List recent deployments
kubectl rollout history deployment/api
# Check deployment timing vs error spike
# If correlated, consider rollback
Step 3: Rollback if Needed
# Rollback to previous version
kubectl rollout undo deployment/api
# Verify rollback
kubectl rollout status deployment/api
Step 4: Investigate Root Cause
# Get error logs
kubectl logs -l app=api --since=30m | grep ERROR
# Check for patterns
kubectl logs -l app=api --since=30m | grep ERROR | sort | uniq -c | sort -rn
Post-Incident
- Document timeline in incident report
- Identify root cause
- Create tickets for fixes
- Schedule post-mortem
### Automated Incident Creation
```typescript
// incidents/auto-create.ts
interface IncidentData {
title: string;
severity: 'P1' | 'P2' | 'P3';
service: string;
description: string;
runbookUrl: string;
}
async function createIncident(data: IncidentData): Promise<string> {
// Create PagerDuty incident
const incident = await pagerduty.createIncident({
title: data.title,
service_key: process.env.PD_SERVICE_KEY,
severity: data.severity,
details: {
service: data.service,
description: data.description,
runbook: data.runbookUrl,
dashboard: `https://grafana.example.com/d/${data.service}`
}
});
// Create Slack channel
const channel = await slack.createChannel(`inc-${incident.id}`);
// Post initial context
await slack.postMessage(channel.id, {
blocks: [
{
type: 'header',
text: { type: 'plain_text', text: `🚨 ${data.title}` }
},
{
type: 'section',
text: {
type: 'mrkdwn',
text: `*Severity:* ${data.severity}\n*Service:* ${data.service}\n*Runbook:* ${data.runbookUrl}`
}
}
]
});
return incident.id;
}
Post-Incident Learning
AI-Assisted Post-Mortem
claude "Analyze this incident and generate a post-mortem:
Timeline:
- 14:23 Alert fired for high error rate
- 14:25 On-call engineer acknowledged
- 14:32 Root cause identified (null pointer in payment handler)
- 14:35 Rollback initiated
- 14:38 Service recovered
Create a blameless post-mortem with:
1. Executive summary
2. Impact assessment
3. Timeline
4. Root cause analysis
5. Contributing factors
6. Action items to prevent recurrence"
Learning from Production
// monitoring/production-learner.ts
// Track patterns in production errors
interface ErrorPattern {
signature: string;
count: number;
firstSeen: Date;
lastSeen: Date;
contexts: object[];
}
async function analyzeProductionPatterns(): Promise<void> {
const errors = await getRecentErrors(24 * 60); // Last 24 hours
const patterns = groupBySignature(errors);
// Generate insights
const prompt = `
Analyze these error patterns from production:
${JSON.stringify(patterns)}
Identify:
1. Common root causes
2. Patterns that suggest code issues
3. Patterns that suggest infrastructure issues
4. Recommendations for code improvements
`;
const insights = await claude.analyze(prompt);
// Create tickets for actionable insights
for (const insight of insights.actionItems) {
await createJiraTicket({
title: insight.title,
description: insight.description,
priority: insight.severity
});
}
}
Module Summary
You've learned to:
- Configure CI/CD pipelines for AI-generated code
- Implement safe rollout strategies
- Monitor and respond to production incidents
Next, we'll explore real-world case studies of advanced vibe coding in action. :::