Incident Response and Postmortems

Incident response questions reveal your production experience. Let's master the process.

Incident Severity Levels

Severity	Impact	Response	Example
SEV1/P1	Full outage	All hands, exec comms	Payment system down
SEV2/P2	Major degradation	Team response	50% error rate
SEV3/P3	Minor impact	Business hours	Slow for some users
SEV4/P4	Low impact	When convenient	Cosmetic issues

Incident Response Process

┌─────────────────────────────────────────────────────┐
│                INCIDENT LIFECYCLE                    │
├─────────────────────────────────────────────────────┤
│  DETECT → TRIAGE → MITIGATE → RESOLVE → POSTMORTEM │
│    │        │         │          │           │      │
│  Alert   Severity   Stop the   Fix root    Learn   │
│  fires   assigned   bleeding   cause       & share │
└─────────────────────────────────────────────────────┘

Phase 1: Detection

Sources:
- Monitoring alerts
- Customer reports
- Automated health checks
- Team observations

Goal: Minimize MTTD (Mean Time To Detect)

Phase 2: Triage

First 5 minutes checklist:
□ Acknowledge the alert
□ Assess severity and scope
□ Determine if this is new or related to existing incident
□ Page appropriate team members
□ Start incident communication channel

Phase 3: Mitigation

Focus on stopping the bleeding, not root cause:

Mitigation	When to Use
Rollback deployment	Recent change suspected
Scale up resources	Capacity issue
Failover to backup	Primary system failing
Feature flag disable	New feature causing issues
Traffic redirect	Regional problem
Restart services	Quick fix for stuck processes

Phase 4: Resolution

1. Verify mitigation is working
2. Investigate root cause (can be async)
3. Implement proper fix
4. Deploy fix with careful monitoring
5. Declare incident resolved

Incident Commander Role

The IC coordinates the response:

Incident Commander responsibilities:
- Assign roles (comms, technical lead, scribe)
- Coordinate investigation streams
- Make decisions on mitigation approach
- Keep stakeholders updated
- Decide when to escalate
- Declare incident resolved

IC does NOT:
- Debug code (unless no one else available)
- Write the postmortem during incident
- Make decisions in isolation

On-Call Best Practices

Healthy On-Call

Practice	Why
Primary + secondary rotation	Backup for coverage
Maximum 1 week shifts	Prevent burnout
Follow-the-sun (if global)	Night pages are rare
Handoff documentation	Context transfer
Page limits (SLO for on-call)	Reduce alert fatigue

Alert Response

# When paged:
1. Acknowledge alert (stops escalation)
2. Check dashboards for context
3. Review recent deployments
4. Start investigation or escalate
5. Update status page if customer-facing

# Don't:
- Ignore and hope it resolves
- Escalate without investigating
- Fix without understanding

Postmortem Writing

Blameless Culture

"How did our system allow this to happen?" not "Who caused this?"

Blameless postmortem principles:
- Focus on systems, not individuals
- Assume everyone acted with best intentions
- Identify process improvements
- Share learnings widely

Postmortem Template

# Incident: [Title] - [Date]

## Summary
Brief description of what happened, impact, and duration.

## Impact
- Users affected: X%
- Duration: X hours
- Revenue impact: $X (if applicable)
- Error budget consumed: X%

## Timeline (all times UTC)
- 10:00 - Alert fired for high error rate
- 10:05 - On-call engineer acknowledged
- 10:15 - Root cause identified: bad database migration
- 10:30 - Rollback initiated
- 10:45 - Service recovered
- 11:00 - Incident declared resolved

## Root Cause
Database migration locked critical tables, causing connection pool exhaustion.

## What Went Well
- Alert fired within 2 minutes of issue
- Quick identification of root cause
- Effective communication

## What Went Wrong
- Migration wasn't tested with production-scale data
- No rollback plan documented
- Status page update was delayed

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add migration testing with prod data | @alice | 2025-01-15 | TODO |
| Create rollback playbook | @bob | 2025-01-10 | TODO |
| Automate status page updates | @carol | 2025-01-20 | TODO |

## Lessons Learned
Database migrations should be treated as high-risk changes with mandatory rollback plans and production-scale testing.

Interview Questions

Q: "Tell me about an incident you handled. What went well and what would you do differently?"

Use the STAR-L format:

Situation: Context and severity
Task: Your role in the response
Action: Steps you took
Result: Outcome and metrics
Learning: What changed because of it

Q: "You're on-call and get paged at 3 AM. Walk me through your process."

1. ACKNOWLEDGE (2 min)
   - Stop alert escalation
   - Check: is this real or false positive?

2. ASSESS (5 min)
   - Dashboard review
   - Severity determination
   - Scope: who's affected?

3. MITIGATE (variable)
   - Can I fix quickly?
   - Do I need to escalate?
   - What's the safest path to restore service?

4. COMMUNICATE (throughout)
   - Update incident channel
   - Status page if customer-facing
   - Escalate if stuck > 30 min

5. FOLLOW UP (next day)
   - Ensure proper resolution
   - Create postmortem if warranted
   - Hand off to day team

You've completed the SRE skills foundation. Final module: Behavioral interviews and salary negotiation. :::