SLOs, SLIs, and Error Budgets

SLOs are the cornerstone of SRE. They define reliability targets and guide engineering decisions.

Terminology

Term	Definition	Example
SLI	Service Level Indicator - what we measure	Request latency p99
SLO	Service Level Objective - our target	p99 latency < 200ms
SLA	Service Level Agreement - contract	99.9% uptime or refund
Error Budget	Allowed unreliability	0.1% = 43.8 min/month

The SLI/SLO/SLA Hierarchy

SLA: Contractual obligation (external)
 │   "We guarantee 99.9% availability or credits issued"
 │
 └── SLO: Internal target (stricter than SLA)
      │   "We aim for 99.95% availability"
      │
      └── SLI: What we actually measure
           "Successful requests / Total requests"

Common SLIs

Availability SLI

# Availability = successful requests / total requests
sum(rate(http_requests_total{status!~"5.."}[30d]))
/
sum(rate(http_requests_total[30d]))

# Result: 0.9995 = 99.95% availability

Latency SLI

# Percentage of requests faster than 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[30d]))
/
sum(rate(http_request_duration_seconds_count[30d]))

# Result: 0.95 = 95% of requests under 200ms

Throughput SLI

# Requests per second
sum(rate(http_requests_total[5m]))

Error Budgets Explained

Error Budget = 100% - SLO Target

SLO	Allowed Downtime/Month	Error Budget
99.9%	43.8 minutes	0.1%
99.95%	21.9 minutes	0.05%
99.99%	4.38 minutes	0.01%
99.999%	26 seconds	0.001%

Error Budget Calculation

30-day period
SLO: 99.9% availability
Total minutes: 30 × 24 × 60 = 43,200 minutes

Error budget: 43,200 × 0.001 = 43.2 minutes of downtime

If we've had 30 minutes of downtime:
Remaining budget: 43.2 - 30 = 13.2 minutes (30% remaining)

Using Error Budgets

Error Budget Policy:

Budget > 50% remaining:
  → Ship features, take risks
  → Can experiment with new deployments

Budget 25-50% remaining:
  → Proceed with caution
  → Prioritize reliability work

Budget < 25% remaining:
  → Feature freeze
  → Focus on stability
  → Investigate root causes

Budget exhausted:
  → No new deployments
  → All hands on reliability
  → Postmortems required

SLO Best Practices

Choosing Good SLOs

Good SLO	Bad SLO
Based on user experience	Based on technical metrics
Measurable and achievable	Too strict or too loose
Has clear consequences	No one pays attention
Reviewed quarterly	Set and forget

Example SLO Document

Service: Payment API
Owner: Payments Team

SLIs:
  - availability:
      description: "Successful payment transactions"
      measurement: "Non-5xx responses / Total responses"
  - latency:
      description: "Payment processing time"
      measurement: "p99 response time"

SLOs:
  - availability:
      target: 99.95%
      window: 30 days rolling
      consequence: "Feature freeze if below 99.9%"
  - latency:
      target: "p99 < 500ms"
      window: 30 days rolling
      consequence: "Performance sprint if exceeded"

Error Budget:
  - monthly_budget: 21.9 minutes
  - alerting: "Page when 50% consumed in < 7 days"
  - escalation: "Manager review at 75% consumed"

Multi-Window SLOs

Detect issues at different timescales:

# Fast burn: Rapid consumption (detect outages)
- window: 1h
  burn_rate: 14.4x  # Would exhaust 30-day budget in 2 days
  alert: page

# Slow burn: Gradual degradation
- window: 6h
  burn_rate: 6x     # Would exhaust budget in 5 days
  alert: page

# Very slow burn: Minor issues
- window: 3d
  burn_rate: 1x     # On track to exhaust budget
  alert: ticket

Interview Questions

Q: "How do you decide on SLO targets?"

Start with user expectations: What do users consider "working"?
Baseline current performance: What are we achieving today?
Consider business needs: What can we afford to maintain?
Account for dependencies: Can't be better than worst dependency
Start conservative: Easier to tighten than loosen

Q: "Your team is about to exhaust the error budget. What do you do?"

Immediate actions:
1. Halt non-critical deployments
2. Review recent changes for rollback candidates
3. Increase monitoring sensitivity
4. Staff additional on-call if needed

Short-term:
1. Identify top error contributors
2. Fix the highest-impact issues
3. Add tests to prevent regression

Long-term:
1. Postmortem on budget exhaustion
2. Review SLO appropriateness
3. Invest in reliability improvements

Q: "How do you handle SLOs for dependencies you don't control?"

Strategy	When to Use
Mirror dependency SLO	Dependency is reliable enough
Set lower target	Dependency is unreliable
Circuit breaker	Fast degradation needed
Multi-provider	Critical service, budget available
Cache/retry	Transient failures acceptable

Next, we'll cover incident response—the ultimate SRE skill test. :::