Distributed Systems & Reliability

Reliability & Observability

4 min read

A distributed backend that cannot measure its own health is flying blind. This lesson covers the SRE toolkit that interviewers expect senior candidates to know: SLOs, circuit breakers, retry strategies, distributed tracing, and chaos engineering.

SLOs, SLIs, and SLAs

These three terms form a hierarchy from measurement to promise:

Term What It Is Example
SLI (Service Level Indicator) A measured metric p99 latency = 187ms
SLO (Service Level Objective) A target for the SLI 99.9% of requests < 200ms
SLA (Service Level Agreement) A contractual commitment with penalties "If uptime falls below 99.95%, customer gets 10% credit"

Availability Levels Table

Target Yearly Downtime Monthly Downtime Typical Use
99% (two nines) 3.65 days 7.3 hours Internal tools, batch jobs
99.9% (three nines) 8.76 hours 43.8 minutes Most SaaS products
99.99% (four nines) 52.6 minutes 4.38 minutes Payment systems, core APIs
99.999% (five nines) 5.26 minutes 26.3 seconds DNS, CDN edge, pacemakers

Error Budgets

The error budget is the allowed amount of unreliability: error budget = 1 - SLO.

With a 99.9% availability SLO:

  • Error budget = 0.1% = 8.76 hours/year of downtime
  • If you consume 6 hours in Q1 from an incident, you have 2.76 hours left
  • If the budget is exhausted, freeze deployments and focus on reliability

Interview tip: Error budgets align incentives — product teams want to ship fast, SRE teams want stability. The error budget gives both sides a shared number to negotiate around.

Circuit Breaker Pattern

A circuit breaker prevents cascading failures by fast-failing requests to an unhealthy dependency instead of letting them time out and exhaust resources.

State Machine

  ┌──────────────────────────────────────────────────┐
  │                                                    │
  │    ┌─────────┐   failures >= threshold   ┌──────┐ │
  │    │ CLOSED  │ ─────────────────────────► │ OPEN │ │
  │    │ (normal)│                            │(fail │ │
  │    │         │ ◄───────────────────────── │ fast)│ │
  │    └─────────┘   successes >= threshold   └──┬───┘ │
  │         ▲                                     │     │
  │         │                                     │     │
  │         │        ┌───────────┐   timeout      │     │
  │         │        │ HALF-OPEN │ ◄──────────────┘     │
  │         │        │ (testing) │                       │
  │         │        └─────┬─────┘                       │
  │         │              │                             │
  │         │   success    │   failure                   │
  │         └──────────────┘──────────────► OPEN         │
  └──────────────────────────────────────────────────┘

States:

  1. Closed (normal): Requests pass through. Count failures. If failures reach the threshold (e.g., 5 failures in 60 seconds), transition to Open.
  2. Open (fast-fail): All requests immediately return an error or fallback response. No calls to the dependency. After a timeout period (e.g., 30 seconds), transition to Half-Open.
  3. Half-Open (testing): Allow a limited number of requests through. If they succeed, transition to Closed. If any fail, transition back to Open.
# Pseudocode: circuit breaker logic
class CircuitBreaker:
    def __init__(self):
        self.state = "CLOSED"
        self.failure_count = 0
        self.failure_threshold = 5
        self.timeout_duration = 30  # seconds
        self.success_threshold = 3
        self.last_failure_time = None

    def call(self, func):
        if self.state == "OPEN":
            if time_since(self.last_failure_time) > self.timeout_duration:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Service unavailable")

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_success(self):
        if self.state == "HALF_OPEN":
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = "CLOSED"
                self.failure_count = 0

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = now()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

Libraries: resilience4j (Java), gobreaker (Go), pybreaker (Python). Hystrix (Netflix) is in maintenance mode — use resilience4j instead.

Retry Strategies

Exponential Backoff with Jitter

Naive retries cause thundering herd — thousands of clients retry simultaneously, overwhelming the recovering service.

  delay = base * 2^attempt

  Attempt 0: 1s
  Attempt 1: 2s
  Attempt 2: 4s
  Attempt 3: 8s
  Attempt 4: 16s (capped at max_delay)

Full jitter randomizes the delay to spread retries across time:

# Exponential backoff with full jitter
import random

def retry_with_backoff(func, max_retries=4, base_delay=1.0, max_delay=30.0):
    for attempt in range(max_retries + 1):
        try:
            return func()
        except RetryableError:
            if attempt == max_retries:
                raise
            exp_delay = min(base_delay * (2 ** attempt), max_delay)
            jittered_delay = random.uniform(0, exp_delay)
            time.sleep(jittered_delay)

Retry + Circuit Breaker Integration

  • Retries handle transient failures (network blips, brief overloads)
  • Circuit breakers handle sustained failures (dependency is down)
  • Together: retry 2-3 times, then if the circuit opens, stop retrying immediately

Interview tip: Always mention both together. Retries without a circuit breaker will bombard a failing service. A circuit breaker without retries gives up too easily on transient errors.

Health Checks

In container orchestration (Kubernetes), three types of probes ensure traffic only reaches healthy pods:

Probe Question Action on Failure
Liveness Is the process alive? Kill and restart the container
Readiness Can it serve traffic? Remove from load balancer, stop sending requests
Startup Has it finished initializing? Do not run liveness/readiness until startup succeeds
# Kubernetes health check configuration
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3    # restart after 3 consecutive failures

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2    # remove from service after 2 failures

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30   # allow up to 5 min for startup (30 * 10s)
  periodSeconds: 10

Key design decisions:

  • Liveness: Check only process health (can respond to HTTP, not stuck in deadlock). Do not check dependencies — if the database is down, restarting the app will not fix it.
  • Readiness: Check dependencies (database connection, cache warmth, config loaded). When not ready, the pod stays running but stops receiving traffic.

Distributed Tracing (OpenTelemetry)

A trace follows a single request across multiple services:

  Trace ID: abc-123

  Service A ─────────────────────────────────────────
  │ Span: HTTP GET /orders (200ms total)             │
  │                                                   │
  │  Service B ──────────────────────────             │
  │  │ Span: DB Query (50ms)            │             │
  │  └──────────────────────────────────┘             │
  │                                                   │
  │  Service C ────────────────────────────────       │
  │  │ Span: Cache Lookup (5ms)                │      │
  │  │ Span: External API call (120ms)         │      │
  │  └────────────────────────────────────────┘      │
  └───────────────────────────────────────────────────┘

Core concepts:

  • Trace: The entire journey of a request (one trace ID)
  • Span: A single unit of work within a trace (has span_id, parent_span_id, duration, status)
  • SpanContext: Metadata propagated across services (trace_id, span_id, trace_flags)

Context propagation: SpanContext is injected into HTTP headers (e.g., traceparent: 00-abc123-def456-01) so downstream services can continue the trace.

Visualization tools: Jaeger, Zipkin, Grafana Tempo

Structured Logging

Unstructured logs ("Error processing order 123") are almost impossible to query at scale. Structured logging emits JSON that can be indexed and searched.

{
  "timestamp": "2026-02-12T10:30:00Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc-123",
  "span_id": "def-456",
  "correlation_id": "req-789",
  "message": "Failed to process order",
  "order_id": "ORD-123",
  "error": "payment gateway timeout",
  "duration_ms": 5023
}

Key fields: trace_id (links to distributed trace), correlation_id (links to business transaction), level (DEBUG, INFO, WARN, ERROR).

RED Method Metrics

The RED method gives you three metrics that cover 90% of service monitoring:

Metric What It Measures Alert Threshold (Example)
Rate Requests per second Drop > 30% from baseline
Errors Error rate (4xx + 5xx) > 1% of total requests
Duration Latency distribution p99 > 500ms
  Rate:     ████████████████████ 2,400 req/s
  Errors:   ██                   0.3% (OK)
  Duration: p50=12ms  p95=89ms  p99=210ms  (OK)

Alerting principle: Alert on symptoms (high latency, elevated error rate), not causes (high CPU). High CPU is fine if latency is normal. Low CPU is bad if the service is deadlocked.

Chaos Engineering

Chaos engineering proactively injects failures to find weaknesses before they cause outages.

Principles

  1. Define steady state: Measure normal behavior (throughput, error rate, latency)
  2. Hypothesize: "If we kill one database replica, the system will failover within 30 seconds with zero errors"
  3. Inject failure: Kill the replica in production (or staging)
  4. Observe: Did the system behave as hypothesized?
  5. Minimize blast radius: Start small (one instance), expand gradually

Common Experiments

Experiment What It Tests
Kill a random pod Auto-restart, load balancing
Add 500ms network latency Timeout settings, circuit breakers
Fill disk to 95% Log rotation, alerting
Block access to a dependency Fallback behavior, graceful degradation
Inject clock skew Time-dependent logic, certificate validation

Tools: Netflix Chaos Monkey (random instance termination), Gremlin (commercial chaos platform), Litmus (Kubernetes-native chaos), tc (Linux traffic control for network experiments).

Interview framework: When discussing reliability, walk through: "We set SLOs, measure with SLIs, use error budgets to balance velocity and reliability, protect services with circuit breakers and retries, observe with traces and metrics, and validate our assumptions with chaos engineering."

You have completed all three lessons in this module. Take the quiz to test your knowledge of distributed systems, concurrency, and reliability engineering. :::

Quiz

Module 5 Quiz: Distributed Systems & Reliability

Take Quiz