Distributed Systems & Reliability

Reliability & Observability

4 min read

A distributed backend that cannot measure its own health is flying blind. This lesson covers the SRE toolkit that interviewers expect senior candidates to know: SLOs, circuit breakers, retry strategies, distributed tracing, and chaos engineering.

SLOs, SLIs, and SLAs

These three terms form a hierarchy from measurement to promise:

TermWhat It IsExample
SLI (Service Level Indicator)A measured metricp99 latency = 187ms
SLO (Service Level Objective)A target for the SLI99.9% of requests < 200ms
SLA (Service Level Agreement)A contractual commitment with penalties"If uptime falls below 99.95%, customer gets 10% credit"

Availability Levels Table

TargetYearly DowntimeMonthly DowntimeTypical Use
99% (two nines)3.65 days7.3 hoursInternal tools, batch jobs
99.9% (three nines)8.76 hours43.8 minutesMost SaaS products
99.99% (four nines)52.6 minutes4.38 minutesPayment systems, core APIs
99.999% (five nines)5.26 minutes26.3 secondsDNS, CDN edge, pacemakers

Error Budgets

The error budget is the allowed amount of unreliability: error budget = 1 - SLO.

With a 99.9% availability SLO:

  • Error budget = 0.1% = 8.76 hours/year of downtime
  • If you consume 6 hours in Q1 from an incident, you have 2.76 hours left
  • If the budget is exhausted, freeze deployments and focus on reliability

Interview tip: Error budgets align incentives — product teams want to ship fast, SRE teams want stability. The error budget gives both sides a shared number to negotiate around.

Circuit Breaker Pattern

A circuit breaker prevents cascading failures by fast-failing requests to an unhealthy dependency instead of letting them time out and exhaust resources.

State Machine

  ┌──────────────────────────────────────────────────┐
  │                                                    │
  │    ┌─────────┐   failures >= threshold   ┌──────┐ │
  │    │ CLOSED  │ ─────────────────────────► │ OPEN │ │
  │    │ (normal)│                            │(fail │ │
  │    │         │ ◄───────────────────────── │ fast)│ │
  │    └─────────┘   successes >= threshold   └──┬───┘ │
  │         ▲                                     │     │
  │         │                                     │     │
  │         │        ┌───────────┐   timeout      │     │
  │         │        │ HALF-OPEN │ ◄──────────────┘     │
  │         │        │ (testing) │                       │
  │         │        └─────┬─────┘                       │
  │         │              │                             │
  │         │   success    │   failure                   │
  │         └──────────────┘──────────────► OPEN         │
  └──────────────────────────────────────────────────┘

States:

  1. Closed (normal): Requests pass through. Count failures. If failures reach the threshold (e.g., 5 failures in 60 seconds), transition to Open.
  2. Open (fast-fail): All requests immediately return an error or fallback response. No calls to the dependency. After a timeout period (e.g., 30 seconds), transition to Half-Open.
  3. Half-Open (testing): Allow a limited number of requests through. If they succeed, transition to Closed. If any fail, transition back to Open.
# Pseudocode: circuit breaker logic
class CircuitBreaker:
    def __init__(self):
        self.state = "CLOSED"
        self.failure_count = 0
        self.failure_threshold = 5
        self.timeout_duration = 30  # seconds
        self.success_threshold = 3
        self.last_failure_time = None

    def call(self, func):
        if self.state == "OPEN":
            if time_since(self.last_failure_time) > self.timeout_duration:
                self.state = "HALF_OPEN"
            else:
                raise CircuitOpenError("Service unavailable")

        try:
            result = func()
            self.on_success()
            return result
        except Exception as e:
            self.on_failure()
            raise e

    def on_success(self):
        if self.state == "HALF_OPEN":
            self.success_count += 1
            if self.success_count >= self.success_threshold:
                self.state = "CLOSED"
                self.failure_count = 0

    def on_failure(self):
        self.failure_count += 1
        self.last_failure_time = now()
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

Libraries: resilience4j (Java), gobreaker (Go), pybreaker (Python). Hystrix (Netflix) is in maintenance mode — use resilience4j instead.

Retry Strategies

Exponential Backoff with Jitter

Naive retries cause thundering herd — thousands of clients retry simultaneously, overwhelming the recovering service.

  delay = base * 2^attempt

  Attempt 0: 1s
  Attempt 1: 2s
  Attempt 2: 4s
  Attempt 3: 8s
  Attempt 4: 16s (capped at max_delay)

Full jitter randomizes the delay to spread retries across time:

# Exponential backoff with full jitter
import random

def retry_with_backoff(func, max_retries=4, base_delay=1.0, max_delay=30.0):
    for attempt in range(max_retries + 1):
        try:
            return func()
        except RetryableError:
            if attempt == max_retries:
                raise
            exp_delay = min(base_delay * (2 ** attempt), max_delay)
            jittered_delay = random.uniform(0, exp_delay)
            time.sleep(jittered_delay)

Retry + Circuit Breaker Integration

  • Retries handle transient failures (network blips, brief overloads)
  • Circuit breakers handle sustained failures (dependency is down)
  • Together: retry 2-3 times, then if the circuit opens, stop retrying immediately

Interview tip: Always mention both together. Retries without a circuit breaker will bombard a failing service. A circuit breaker without retries gives up too easily on transient errors.

Health Checks

In container orchestration (Kubernetes), three types of probes ensure traffic only reaches healthy pods:

ProbeQuestionAction on Failure
LivenessIs the process alive?Kill and restart the container
ReadinessCan it serve traffic?Remove from load balancer, stop sending requests
StartupHas it finished initializing?Do not run liveness/readiness until startup succeeds
# Kubernetes health check configuration
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 15
  failureThreshold: 3    # restart after 3 consecutive failures

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  periodSeconds: 5
  failureThreshold: 2    # remove from service after 2 failures

startupProbe:
  httpGet:
    path: /healthz
    port: 8080
  failureThreshold: 30   # allow up to 5 min for startup (30 * 10s)
  periodSeconds: 10

Key design decisions:

  • Liveness: Check only process health (can respond to HTTP, not stuck in deadlock). Do not check dependencies — if the database is down, restarting the app will not fix it.
  • Readiness: Check dependencies (database connection, cache warmth, config loaded). When not ready, the pod stays running but stops receiving traffic.

Distributed Tracing (OpenTelemetry)

A trace follows a single request across multiple services:

  Trace ID: abc-123

  Service A ─────────────────────────────────────────
  │ Span: HTTP GET /orders (200ms total)             │
  │                                                   │
  │  Service B ──────────────────────────             │
  │  │ Span: DB Query (50ms)            │             │
  │  └──────────────────────────────────┘             │
  │                                                   │
  │  Service C ────────────────────────────────       │
  │  │ Span: Cache Lookup (5ms)                │      │
  │  │ Span: External API call (120ms)         │      │
  │  └────────────────────────────────────────┘      │
  └───────────────────────────────────────────────────┘

Core concepts:

  • Trace: The entire journey of a request (one trace ID)
  • Span: A single unit of work within a trace (has span_id, parent_span_id, duration, status)
  • SpanContext: Metadata propagated across services (trace_id, span_id, trace_flags)

Context propagation: SpanContext is injected into HTTP headers (e.g., traceparent: 00-abc123-def456-01) so downstream services can continue the trace.

Visualization tools: Jaeger, Zipkin, Grafana Tempo

Structured Logging

Unstructured logs ("Error processing order 123") are almost impossible to query at scale. Structured logging emits JSON that can be indexed and searched.

{
  "timestamp": "2026-02-12T10:30:00Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc-123",
  "span_id": "def-456",
  "correlation_id": "req-789",
  "message": "Failed to process order",
  "order_id": "ORD-123",
  "error": "payment gateway timeout",
  "duration_ms": 5023
}

Key fields: trace_id (links to distributed trace), correlation_id (links to business transaction), level (DEBUG, INFO, WARN, ERROR).

RED Method Metrics

The RED method gives you three metrics that cover 90% of service monitoring:

MetricWhat It MeasuresAlert Threshold (Example)
RateRequests per secondDrop > 30% from baseline
ErrorsError rate (4xx + 5xx)> 1% of total requests
DurationLatency distributionp99 > 500ms
  Rate:     ████████████████████ 2,400 req/s
  Errors:   ██                   0.3% (OK)
  Duration: p50=12ms  p95=89ms  p99=210ms  (OK)

Alerting principle: Alert on symptoms (high latency, elevated error rate), not causes (high CPU). High CPU is fine if latency is normal. Low CPU is bad if the service is deadlocked.

Chaos Engineering

Chaos engineering proactively injects failures to find weaknesses before they cause outages.

Principles

  1. Define steady state: Measure normal behavior (throughput, error rate, latency)
  2. Hypothesize: "If we kill one database replica, the system will failover within 30 seconds with zero errors"
  3. Inject failure: Kill the replica in production (or staging)
  4. Observe: Did the system behave as hypothesized?
  5. Minimize blast radius: Start small (one instance), expand gradually

Common Experiments

ExperimentWhat It Tests
Kill a random podAuto-restart, load balancing
Add 500ms network latencyTimeout settings, circuit breakers
Fill disk to 95%Log rotation, alerting
Block access to a dependencyFallback behavior, graceful degradation
Inject clock skewTime-dependent logic, certificate validation

Tools: Netflix Chaos Monkey (random instance termination), Gremlin (commercial chaos platform), Litmus (Kubernetes-native chaos), tc (Linux traffic control for network experiments).

Interview framework: When discussing reliability, walk through: "We set SLOs, measure with SLIs, use error budgets to balance velocity and reliability, protect services with circuit breakers and retries, observe with traces and metrics, and validate our assumptions with chaos engineering."

You have completed all three lessons in this module. Take the quiz to test your knowledge of distributed systems, concurrency, and reliability engineering. :::

Quiz

Module 5 Quiz: Distributed Systems & Reliability

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.