Distributed Systems & Reliability
Reliability & Observability
A distributed backend that cannot measure its own health is flying blind. This lesson covers the SRE toolkit that interviewers expect senior candidates to know: SLOs, circuit breakers, retry strategies, distributed tracing, and chaos engineering.
SLOs, SLIs, and SLAs
These three terms form a hierarchy from measurement to promise:
| Term | What It Is | Example |
|---|---|---|
| SLI (Service Level Indicator) | A measured metric | p99 latency = 187ms |
| SLO (Service Level Objective) | A target for the SLI | 99.9% of requests < 200ms |
| SLA (Service Level Agreement) | A contractual commitment with penalties | "If uptime falls below 99.95%, customer gets 10% credit" |
Availability Levels Table
| Target | Yearly Downtime | Monthly Downtime | Typical Use |
|---|---|---|---|
| 99% (two nines) | 3.65 days | 7.3 hours | Internal tools, batch jobs |
| 99.9% (three nines) | 8.76 hours | 43.8 minutes | Most SaaS products |
| 99.99% (four nines) | 52.6 minutes | 4.38 minutes | Payment systems, core APIs |
| 99.999% (five nines) | 5.26 minutes | 26.3 seconds | DNS, CDN edge, pacemakers |
Error Budgets
The error budget is the allowed amount of unreliability: error budget = 1 - SLO.
With a 99.9% availability SLO:
- Error budget = 0.1% = 8.76 hours/year of downtime
- If you consume 6 hours in Q1 from an incident, you have 2.76 hours left
- If the budget is exhausted, freeze deployments and focus on reliability
Interview tip: Error budgets align incentives — product teams want to ship fast, SRE teams want stability. The error budget gives both sides a shared number to negotiate around.
Circuit Breaker Pattern
A circuit breaker prevents cascading failures by fast-failing requests to an unhealthy dependency instead of letting them time out and exhaust resources.
State Machine
┌──────────────────────────────────────────────────┐
│ │
│ ┌─────────┐ failures >= threshold ┌──────┐ │
│ │ CLOSED │ ─────────────────────────► │ OPEN │ │
│ │ (normal)│ │(fail │ │
│ │ │ ◄───────────────────────── │ fast)│ │
│ └─────────┘ successes >= threshold └──┬───┘ │
│ ▲ │ │
│ │ │ │
│ │ ┌───────────┐ timeout │ │
│ │ │ HALF-OPEN │ ◄──────────────┘ │
│ │ │ (testing) │ │
│ │ └─────┬─────┘ │
│ │ │ │
│ │ success │ failure │
│ └──────────────┘──────────────► OPEN │
└──────────────────────────────────────────────────┘
States:
- Closed (normal): Requests pass through. Count failures. If failures reach the threshold (e.g., 5 failures in 60 seconds), transition to Open.
- Open (fast-fail): All requests immediately return an error or fallback response. No calls to the dependency. After a timeout period (e.g., 30 seconds), transition to Half-Open.
- Half-Open (testing): Allow a limited number of requests through. If they succeed, transition to Closed. If any fail, transition back to Open.
# Pseudocode: circuit breaker logic
class CircuitBreaker:
def __init__(self):
self.state = "CLOSED"
self.failure_count = 0
self.failure_threshold = 5
self.timeout_duration = 30 # seconds
self.success_threshold = 3
self.last_failure_time = None
def call(self, func):
if self.state == "OPEN":
if time_since(self.last_failure_time) > self.timeout_duration:
self.state = "HALF_OPEN"
else:
raise CircuitOpenError("Service unavailable")
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise e
def on_success(self):
if self.state == "HALF_OPEN":
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = "CLOSED"
self.failure_count = 0
def on_failure(self):
self.failure_count += 1
self.last_failure_time = now()
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
Libraries: resilience4j (Java), gobreaker (Go), pybreaker (Python). Hystrix (Netflix) is in maintenance mode — use resilience4j instead.
Retry Strategies
Exponential Backoff with Jitter
Naive retries cause thundering herd — thousands of clients retry simultaneously, overwhelming the recovering service.
delay = base * 2^attempt
Attempt 0: 1s
Attempt 1: 2s
Attempt 2: 4s
Attempt 3: 8s
Attempt 4: 16s (capped at max_delay)
Full jitter randomizes the delay to spread retries across time:
# Exponential backoff with full jitter
import random
def retry_with_backoff(func, max_retries=4, base_delay=1.0, max_delay=30.0):
for attempt in range(max_retries + 1):
try:
return func()
except RetryableError:
if attempt == max_retries:
raise
exp_delay = min(base_delay * (2 ** attempt), max_delay)
jittered_delay = random.uniform(0, exp_delay)
time.sleep(jittered_delay)
Retry + Circuit Breaker Integration
- Retries handle transient failures (network blips, brief overloads)
- Circuit breakers handle sustained failures (dependency is down)
- Together: retry 2-3 times, then if the circuit opens, stop retrying immediately
Interview tip: Always mention both together. Retries without a circuit breaker will bombard a failing service. A circuit breaker without retries gives up too easily on transient errors.
Health Checks
In container orchestration (Kubernetes), three types of probes ensure traffic only reaches healthy pods:
| Probe | Question | Action on Failure |
|---|---|---|
| Liveness | Is the process alive? | Kill and restart the container |
| Readiness | Can it serve traffic? | Remove from load balancer, stop sending requests |
| Startup | Has it finished initializing? | Do not run liveness/readiness until startup succeeds |
# Kubernetes health check configuration
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 15
failureThreshold: 3 # restart after 3 consecutive failures
readinessProbe:
httpGet:
path: /ready
port: 8080
periodSeconds: 5
failureThreshold: 2 # remove from service after 2 failures
startupProbe:
httpGet:
path: /healthz
port: 8080
failureThreshold: 30 # allow up to 5 min for startup (30 * 10s)
periodSeconds: 10
Key design decisions:
- Liveness: Check only process health (can respond to HTTP, not stuck in deadlock). Do not check dependencies — if the database is down, restarting the app will not fix it.
- Readiness: Check dependencies (database connection, cache warmth, config loaded). When not ready, the pod stays running but stops receiving traffic.
Distributed Tracing (OpenTelemetry)
A trace follows a single request across multiple services:
Trace ID: abc-123
Service A ─────────────────────────────────────────
│ Span: HTTP GET /orders (200ms total) │
│ │
│ Service B ────────────────────────── │
│ │ Span: DB Query (50ms) │ │
│ └──────────────────────────────────┘ │
│ │
│ Service C ──────────────────────────────── │
│ │ Span: Cache Lookup (5ms) │ │
│ │ Span: External API call (120ms) │ │
│ └────────────────────────────────────────┘ │
└───────────────────────────────────────────────────┘
Core concepts:
- Trace: The entire journey of a request (one trace ID)
- Span: A single unit of work within a trace (has span_id, parent_span_id, duration, status)
- SpanContext: Metadata propagated across services (trace_id, span_id, trace_flags)
Context propagation: SpanContext is injected into HTTP headers (e.g., traceparent: 00-abc123-def456-01) so downstream services can continue the trace.
Visualization tools: Jaeger, Zipkin, Grafana Tempo
Structured Logging
Unstructured logs ("Error processing order 123") are almost impossible to query at scale. Structured logging emits JSON that can be indexed and searched.
{
"timestamp": "2026-02-12T10:30:00Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc-123",
"span_id": "def-456",
"correlation_id": "req-789",
"message": "Failed to process order",
"order_id": "ORD-123",
"error": "payment gateway timeout",
"duration_ms": 5023
}
Key fields: trace_id (links to distributed trace), correlation_id (links to business transaction), level (DEBUG, INFO, WARN, ERROR).
RED Method Metrics
The RED method gives you three metrics that cover 90% of service monitoring:
| Metric | What It Measures | Alert Threshold (Example) |
|---|---|---|
| Rate | Requests per second | Drop > 30% from baseline |
| Errors | Error rate (4xx + 5xx) | > 1% of total requests |
| Duration | Latency distribution | p99 > 500ms |
Rate: ████████████████████ 2,400 req/s
Errors: ██ 0.3% (OK)
Duration: p50=12ms p95=89ms p99=210ms (OK)
Alerting principle: Alert on symptoms (high latency, elevated error rate), not causes (high CPU). High CPU is fine if latency is normal. Low CPU is bad if the service is deadlocked.
Chaos Engineering
Chaos engineering proactively injects failures to find weaknesses before they cause outages.
Principles
- Define steady state: Measure normal behavior (throughput, error rate, latency)
- Hypothesize: "If we kill one database replica, the system will failover within 30 seconds with zero errors"
- Inject failure: Kill the replica in production (or staging)
- Observe: Did the system behave as hypothesized?
- Minimize blast radius: Start small (one instance), expand gradually
Common Experiments
| Experiment | What It Tests |
|---|---|
| Kill a random pod | Auto-restart, load balancing |
| Add 500ms network latency | Timeout settings, circuit breakers |
| Fill disk to 95% | Log rotation, alerting |
| Block access to a dependency | Fallback behavior, graceful degradation |
| Inject clock skew | Time-dependent logic, certificate validation |
Tools: Netflix Chaos Monkey (random instance termination), Gremlin (commercial chaos platform), Litmus (Kubernetes-native chaos), tc (Linux traffic control for network experiments).
Interview framework: When discussing reliability, walk through: "We set SLOs, measure with SLIs, use error budgets to balance velocity and reliability, protect services with circuit breakers and retries, observe with traces and metrics, and validate our assumptions with chaos engineering."
You have completed all three lessons in this module. Take the quiz to test your knowledge of distributed systems, concurrency, and reliability engineering. :::