Monitoring, Observability & Incident Response

SLOs, SLIs, and Error Budgets

4 min read

SLOs are the cornerstone of SRE. They define reliability targets and guide engineering decisions.

Terminology

Term Definition Example
SLI Service Level Indicator - what we measure Request latency p99
SLO Service Level Objective - our target p99 latency < 200ms
SLA Service Level Agreement - contract 99.9% uptime or refund
Error Budget Allowed unreliability 0.1% = 43.8 min/month

The SLI/SLO/SLA Hierarchy

SLA: Contractual obligation (external)
 │   "We guarantee 99.9% availability or credits issued"
 └── SLO: Internal target (stricter than SLA)
      │   "We aim for 99.95% availability"
      └── SLI: What we actually measure
           "Successful requests / Total requests"

Common SLIs

Availability SLI

# Availability = successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Result: 0.9995 = 99.95% availability

Latency SLI

# Percentage of requests faster than 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

# Result: 0.95 = 95% of requests under 200ms

Throughput SLI

# Requests per second
sum(rate(http_requests_total[5m]))

Error Budgets Explained

Error Budget = 100% - SLO Target

SLO Allowed Downtime/Month Error Budget
99.9% 43.8 minutes 0.1%
99.95% 21.9 minutes 0.05%
99.99% 4.38 minutes 0.01%
99.999% 26 seconds 0.001%

Error Budget Calculation

30-day period
SLO: 99.9% availability
Total minutes: 30 × 24 × 60 = 43,200 minutes

Error budget: 43,200 × 0.001 = 43.2 minutes of downtime

If we've had 30 minutes of downtime:
Remaining budget: 43.2 - 30 = 13.2 minutes (30% remaining)

Using Error Budgets

Error Budget Policy:

Budget > 50% remaining:
  → Ship features, take risks
  → Can experiment with new deployments

Budget 25-50% remaining:
  → Proceed with caution
  → Prioritize reliability work

Budget < 25% remaining:
  → Feature freeze
  → Focus on stability
  → Investigate root causes

Budget exhausted:
  → No new deployments
  → All hands on reliability
  → Postmortems required

SLO Best Practices

Choosing Good SLOs

Good SLO Bad SLO
Based on user experience Based on technical metrics
Measurable and achievable Too strict or too loose
Has clear consequences No one pays attention
Reviewed quarterly Set and forget

Example SLO Document

Service: Payment API
Owner: Payments Team

SLIs:
  - availability:
      description: "Successful payment transactions"
      measurement: "Non-5xx responses / Total responses"
  - latency:
      description: "Payment processing time"
      measurement: "p99 response time"

SLOs:
  - availability:
      target: 99.95%
      window: 30 days rolling
      consequence: "Feature freeze if below 99.9%"
  - latency:
      target: "p99 < 500ms"
      window: 30 days rolling
      consequence: "Performance sprint if exceeded"

Error Budget:
  - monthly_budget: 21.9 minutes
  - alerting: "Page when 50% consumed in < 7 days"
  - escalation: "Manager review at 75% consumed"

Multi-Window SLOs

Detect issues at different timescales:

# Fast burn: Rapid consumption (detect outages)
- window: 1h
  burn_rate: 14.4x  # Would exhaust 30-day budget in 2 days
  alert: page

# Slow burn: Gradual degradation
- window: 6h
  burn_rate: 6x     # Would exhaust budget in 5 days
  alert: page

# Very slow burn: Minor issues
- window: 3d
  burn_rate: 1x     # On track to exhaust budget
  alert: ticket

Interview Questions

Q: "How do you decide on SLO targets?"

  1. Start with user expectations: What do users consider "working"?
  2. Baseline current performance: What are we achieving today?
  3. Consider business needs: What can we afford to maintain?
  4. Account for dependencies: Can't be better than worst dependency
  5. Start conservative: Easier to tighten than loosen

Q: "Your team is about to exhaust the error budget. What do you do?"

Immediate actions:
1. Halt non-critical deployments
2. Review recent changes for rollback candidates
3. Increase monitoring sensitivity
4. Staff additional on-call if needed

Short-term:
1. Identify top error contributors
2. Fix the highest-impact issues
3. Add tests to prevent regression

Long-term:
1. Postmortem on budget exhaustion
2. Review SLO appropriateness
3. Invest in reliability improvements

Q: "How do you handle SLOs for dependencies you don't control?"

Strategy When to Use
Mirror dependency SLO Dependency is reliable enough
Set lower target Dependency is unreliable
Circuit breaker Fast degradation needed
Multi-provider Critical service, budget available
Cache/retry Transient failures acceptable

Next, we'll cover incident response—the ultimate SRE skill test. :::

Quiz

Module 5: Monitoring, Observability & Incident Response

Take Quiz