Monitoring, Observability & Incident Response
Incident Response and Postmortems
4 min read
Incident response questions reveal your production experience. Let's master the process.
Incident Severity Levels
| Severity | Impact | Response | Example |
|---|---|---|---|
| SEV1/P1 | Full outage | All hands, exec comms | Payment system down |
| SEV2/P2 | Major degradation | Team response | 50% error rate |
| SEV3/P3 | Minor impact | Business hours | Slow for some users |
| SEV4/P4 | Low impact | When convenient | Cosmetic issues |
Incident Response Process
┌─────────────────────────────────────────────────────┐
│ INCIDENT LIFECYCLE │
├─────────────────────────────────────────────────────┤
│ DETECT → TRIAGE → MITIGATE → RESOLVE → POSTMORTEM │
│ │ │ │ │ │ │
│ Alert Severity Stop the Fix root Learn │
│ fires assigned bleeding cause & share │
└─────────────────────────────────────────────────────┘
Phase 1: Detection
Sources:
- Monitoring alerts
- Customer reports
- Automated health checks
- Team observations
Goal: Minimize MTTD (Mean Time To Detect)
Phase 2: Triage
First 5 minutes checklist:
□ Acknowledge the alert
□ Assess severity and scope
□ Determine if this is new or related to existing incident
□ Page appropriate team members
□ Start incident communication channel
Phase 3: Mitigation
Focus on stopping the bleeding, not root cause:
| Mitigation | When to Use |
|---|---|
| Rollback deployment | Recent change suspected |
| Scale up resources | Capacity issue |
| Failover to backup | Primary system failing |
| Feature flag disable | New feature causing issues |
| Traffic redirect | Regional problem |
| Restart services | Quick fix for stuck processes |
Phase 4: Resolution
1. Verify mitigation is working
2. Investigate root cause (can be async)
3. Implement proper fix
4. Deploy fix with careful monitoring
5. Declare incident resolved
Incident Commander Role
The IC coordinates the response:
Incident Commander responsibilities:
- Assign roles (comms, technical lead, scribe)
- Coordinate investigation streams
- Make decisions on mitigation approach
- Keep stakeholders updated
- Decide when to escalate
- Declare incident resolved
IC does NOT:
- Debug code (unless no one else available)
- Write the postmortem during incident
- Make decisions in isolation
On-Call Best Practices
Healthy On-Call
| Practice | Why |
|---|---|
| Primary + secondary rotation | Backup for coverage |
| Maximum 1 week shifts | Prevent burnout |
| Follow-the-sun (if global) | Night pages are rare |
| Handoff documentation | Context transfer |
| Page limits (SLO for on-call) | Reduce alert fatigue |
Alert Response
# When paged:
1. Acknowledge alert (stops escalation)
2. Check dashboards for context
3. Review recent deployments
4. Start investigation or escalate
5. Update status page if customer-facing
# Don't:
- Ignore and hope it resolves
- Escalate without investigating
- Fix without understanding
Postmortem Writing
Blameless Culture
"How did our system allow this to happen?" not "Who caused this?"
Blameless postmortem principles:
- Focus on systems, not individuals
- Assume everyone acted with best intentions
- Identify process improvements
- Share learnings widely
Postmortem Template
# Incident: [Title] - [Date]
## Summary
Brief description of what happened, impact, and duration.
## Impact
- Users affected: X%
- Duration: X hours
- Revenue impact: $X (if applicable)
- Error budget consumed: X%
## Timeline (all times UTC)
- 10:00 - Alert fired for high error rate
- 10:05 - On-call engineer acknowledged
- 10:15 - Root cause identified: bad database migration
- 10:30 - Rollback initiated
- 10:45 - Service recovered
- 11:00 - Incident declared resolved
## Root Cause
Database migration locked critical tables, causing connection pool exhaustion.
## What Went Well
- Alert fired within 2 minutes of issue
- Quick identification of root cause
- Effective communication
## What Went Wrong
- Migration wasn't tested with production-scale data
- No rollback plan documented
- Status page update was delayed
## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add migration testing with prod data | @alice | 2025-01-15 | TODO |
| Create rollback playbook | @bob | 2025-01-10 | TODO |
| Automate status page updates | @carol | 2025-01-20 | TODO |
## Lessons Learned
Database migrations should be treated as high-risk changes with mandatory rollback plans and production-scale testing.
Interview Questions
Q: "Tell me about an incident you handled. What went well and what would you do differently?"
Use the STAR-L format:
- Situation: Context and severity
- Task: Your role in the response
- Action: Steps you took
- Result: Outcome and metrics
- Learning: What changed because of it
Q: "You're on-call and get paged at 3 AM. Walk me through your process."
1. ACKNOWLEDGE (2 min)
- Stop alert escalation
- Check: is this real or false positive?
2. ASSESS (5 min)
- Dashboard review
- Severity determination
- Scope: who's affected?
3. MITIGATE (variable)
- Can I fix quickly?
- Do I need to escalate?
- What's the safest path to restore service?
4. COMMUNICATE (throughout)
- Update incident channel
- Status page if customer-facing
- Escalate if stuck > 30 min
5. FOLLOW UP (next day)
- Ensure proper resolution
- Create postmortem if warranted
- Hand off to day team
You've completed the SRE skills foundation. Final module: Behavioral interviews and salary negotiation. :::