Monitoring, Observability & Incident Response
SLOs, SLIs, and Error Budgets
4 min read
SLOs are the cornerstone of SRE. They define reliability targets and guide engineering decisions.
Terminology
| Term | Definition | Example |
|---|---|---|
| SLI | Service Level Indicator - what we measure | Request latency p99 |
| SLO | Service Level Objective - our target | p99 latency < 200ms |
| SLA | Service Level Agreement - contract | 99.9% uptime or refund |
| Error Budget | Allowed unreliability | 0.1% = 43.8 min/month |
The SLI/SLO/SLA Hierarchy
SLA: Contractual obligation (external)
│ "We guarantee 99.9% availability or credits issued"
│
└── SLO: Internal target (stricter than SLA)
│ "We aim for 99.95% availability"
│
└── SLI: What we actually measure
"Successful requests / Total requests"
Common SLIs
Availability SLI
# Availability = successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))
# Result: 0.9995 = 99.95% availability
Latency SLI
# Percentage of requests faster than 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))
# Result: 0.95 = 95% of requests under 200ms
Throughput SLI
# Requests per second
sum(rate(http_requests_total[5m]))
Error Budgets Explained
Error Budget = 100% - SLO Target
| SLO | Allowed Downtime/Month | Error Budget |
|---|---|---|
| 99.9% | 43.8 minutes | 0.1% |
| 99.95% | 21.9 minutes | 0.05% |
| 99.99% | 4.38 minutes | 0.01% |
| 99.999% | 26 seconds | 0.001% |
Error Budget Calculation
30-day period
SLO: 99.9% availability
Total minutes: 30 × 24 × 60 = 43,200 minutes
Error budget: 43,200 × 0.001 = 43.2 minutes of downtime
If we've had 30 minutes of downtime:
Remaining budget: 43.2 - 30 = 13.2 minutes (30% remaining)
Using Error Budgets
Error Budget Policy:
Budget > 50% remaining:
→ Ship features, take risks
→ Can experiment with new deployments
Budget 25-50% remaining:
→ Proceed with caution
→ Prioritize reliability work
Budget < 25% remaining:
→ Feature freeze
→ Focus on stability
→ Investigate root causes
Budget exhausted:
→ No new deployments
→ All hands on reliability
→ Postmortems required
SLO Best Practices
Choosing Good SLOs
| Good SLO | Bad SLO |
|---|---|
| Based on user experience | Based on technical metrics |
| Measurable and achievable | Too strict or too loose |
| Has clear consequences | No one pays attention |
| Reviewed quarterly | Set and forget |
Example SLO Document
Service: Payment API
Owner: Payments Team
SLIs:
- availability:
description: "Successful payment transactions"
measurement: "Non-5xx responses / Total responses"
- latency:
description: "Payment processing time"
measurement: "p99 response time"
SLOs:
- availability:
target: 99.95%
window: 30 days rolling
consequence: "Feature freeze if below 99.9%"
- latency:
target: "p99 < 500ms"
window: 30 days rolling
consequence: "Performance sprint if exceeded"
Error Budget:
- monthly_budget: 21.9 minutes
- alerting: "Page when 50% consumed in < 7 days"
- escalation: "Manager review at 75% consumed"
Multi-Window SLOs
Detect issues at different timescales:
# Fast burn: Rapid consumption (detect outages)
- window: 1h
burn_rate: 14.4x # Would exhaust 30-day budget in 2 days
alert: page
# Slow burn: Gradual degradation
- window: 6h
burn_rate: 6x # Would exhaust budget in 5 days
alert: page
# Very slow burn: Minor issues
- window: 3d
burn_rate: 1x # On track to exhaust budget
alert: ticket
Interview Questions
Q: "How do you decide on SLO targets?"
- Start with user expectations: What do users consider "working"?
- Baseline current performance: What are we achieving today?
- Consider business needs: What can we afford to maintain?
- Account for dependencies: Can't be better than worst dependency
- Start conservative: Easier to tighten than loosen
Q: "Your team is about to exhaust the error budget. What do you do?"
Immediate actions:
1. Halt non-critical deployments
2. Review recent changes for rollback candidates
3. Increase monitoring sensitivity
4. Staff additional on-call if needed
Short-term:
1. Identify top error contributors
2. Fix the highest-impact issues
3. Add tests to prevent regression
Long-term:
1. Postmortem on budget exhaustion
2. Review SLO appropriateness
3. Invest in reliability improvements
Q: "How do you handle SLOs for dependencies you don't control?"
| Strategy | When to Use |
|---|---|
| Mirror dependency SLO | Dependency is reliable enough |
| Set lower target | Dependency is unreliable |
| Circuit breaker | Fast degradation needed |
| Multi-provider | Critical service, budget available |
| Cache/retry | Transient failures acceptable |
Next, we'll cover incident response—the ultimate SRE skill test. :::