Monitoring, Observability & Incident Response

Incident Response and Postmortems

4 min read

Incident response questions reveal your production experience. Let's master the process.

Incident Severity Levels

Severity Impact Response Example
SEV1/P1 Full outage All hands, exec comms Payment system down
SEV2/P2 Major degradation Team response 50% error rate
SEV3/P3 Minor impact Business hours Slow for some users
SEV4/P4 Low impact When convenient Cosmetic issues

Incident Response Process

┌─────────────────────────────────────────────────────┐
│                INCIDENT LIFECYCLE                    │
├─────────────────────────────────────────────────────┤
│  DETECT → TRIAGE → MITIGATE → RESOLVE → POSTMORTEM │
│    │        │         │          │           │      │
│  Alert   Severity   Stop the   Fix root    Learn   │
│  fires   assigned   bleeding   cause       & share │
└─────────────────────────────────────────────────────┘

Phase 1: Detection

Sources:
- Monitoring alerts
- Customer reports
- Automated health checks
- Team observations

Goal: Minimize MTTD (Mean Time To Detect)

Phase 2: Triage

First 5 minutes checklist:
□ Acknowledge the alert
□ Assess severity and scope
□ Determine if this is new or related to existing incident
□ Page appropriate team members
□ Start incident communication channel

Phase 3: Mitigation

Focus on stopping the bleeding, not root cause:

Mitigation When to Use
Rollback deployment Recent change suspected
Scale up resources Capacity issue
Failover to backup Primary system failing
Feature flag disable New feature causing issues
Traffic redirect Regional problem
Restart services Quick fix for stuck processes

Phase 4: Resolution

1. Verify mitigation is working
2. Investigate root cause (can be async)
3. Implement proper fix
4. Deploy fix with careful monitoring
5. Declare incident resolved

Incident Commander Role

The IC coordinates the response:

Incident Commander responsibilities:
- Assign roles (comms, technical lead, scribe)
- Coordinate investigation streams
- Make decisions on mitigation approach
- Keep stakeholders updated
- Decide when to escalate
- Declare incident resolved

IC does NOT:
- Debug code (unless no one else available)
- Write the postmortem during incident
- Make decisions in isolation

On-Call Best Practices

Healthy On-Call

Practice Why
Primary + secondary rotation Backup for coverage
Maximum 1 week shifts Prevent burnout
Follow-the-sun (if global) Night pages are rare
Handoff documentation Context transfer
Page limits (SLO for on-call) Reduce alert fatigue

Alert Response

# When paged:
1. Acknowledge alert (stops escalation)
2. Check dashboards for context
3. Review recent deployments
4. Start investigation or escalate
5. Update status page if customer-facing

# Don't:
- Ignore and hope it resolves
- Escalate without investigating
- Fix without understanding

Postmortem Writing

Blameless Culture

"How did our system allow this to happen?" not "Who caused this?"

Blameless postmortem principles:
- Focus on systems, not individuals
- Assume everyone acted with best intentions
- Identify process improvements
- Share learnings widely

Postmortem Template

# Incident: [Title] - [Date]

## Summary
Brief description of what happened, impact, and duration.

## Impact
- Users affected: X%
- Duration: X hours
- Revenue impact: $X (if applicable)
- Error budget consumed: X%

## Timeline (all times UTC)
- 10:00 - Alert fired for high error rate
- 10:05 - On-call engineer acknowledged
- 10:15 - Root cause identified: bad database migration
- 10:30 - Rollback initiated
- 10:45 - Service recovered
- 11:00 - Incident declared resolved

## Root Cause
Database migration locked critical tables, causing connection pool exhaustion.

## What Went Well
- Alert fired within 2 minutes of issue
- Quick identification of root cause
- Effective communication

## What Went Wrong
- Migration wasn't tested with production-scale data
- No rollback plan documented
- Status page update was delayed

## Action Items
| Action | Owner | Due Date | Status |
|--------|-------|----------|--------|
| Add migration testing with prod data | @alice | 2025-01-15 | TODO |
| Create rollback playbook | @bob | 2025-01-10 | TODO |
| Automate status page updates | @carol | 2025-01-20 | TODO |

## Lessons Learned
Database migrations should be treated as high-risk changes with mandatory rollback plans and production-scale testing.

Interview Questions

Q: "Tell me about an incident you handled. What went well and what would you do differently?"

Use the STAR-L format:

  • Situation: Context and severity
  • Task: Your role in the response
  • Action: Steps you took
  • Result: Outcome and metrics
  • Learning: What changed because of it

Q: "You're on-call and get paged at 3 AM. Walk me through your process."

1. ACKNOWLEDGE (2 min)
   - Stop alert escalation
   - Check: is this real or false positive?

2. ASSESS (5 min)
   - Dashboard review
   - Severity determination
   - Scope: who's affected?

3. MITIGATE (variable)
   - Can I fix quickly?
   - Do I need to escalate?
   - What's the safest path to restore service?

4. COMMUNICATE (throughout)
   - Update incident channel
   - Status page if customer-facing
   - Escalate if stuck > 30 min

5. FOLLOW UP (next day)
   - Ensure proper resolution
   - Create postmortem if warranted
   - Hand off to day team

You've completed the SRE skills foundation. Final module: Behavioral interviews and salary negotiation. :::

Quiz

Module 5: Monitoring, Observability & Incident Response

Take Quiz