Distributed Systems & Reliability

Consistency, Availability & Consensus

5 min read

Distributed systems are the backbone of every large-scale backend. This lesson covers the fundamental theorems, consistency models, and consensus algorithms that interviewers at L5+ expect you to reason about fluently.

CAP Theorem Deep Dive

The CAP theorem (Brewer, 2000) states that a distributed data store can only guarantee two of three properties simultaneously:

  • Consistency (C): Every read receives the most recent write or an error
  • Availability (A): Every request receives a non-error response (no guarantee it contains the most recent write)
  • Partition Tolerance (P): The system continues operating despite network partitions between nodes

Partition Tolerance Is Not Optional

In any distributed system, network partitions will happen. Cables get cut, switches fail, cloud availability zones lose connectivity. Since you cannot avoid partitions, the real choice is between CP and AP when a partition occurs.

                    CAP Theorem

        C ─────────────────────── A
        │         You can          │
        │        pick two          │
        │                          │
        │     ┌──────────┐         │
        │     │    CA     │         │
        │     │ (single   │         │
        │     │  node     │         │
        │     │  only)    │         │
        │     └──────────┘         │
        │                          │
        └──────────── P ───────────┘
              (always present
               in distributed
               systems)

CP Systems — Choose Consistency Over Availability

When a partition occurs, CP systems reject requests rather than return stale data.

System How It Achieves CP
Google Spanner TrueTime API (GPS + atomic clocks) provides external consistency. Commits wait for uncertainty interval to pass, guaranteeing global ordering.
HBase Single master for each region. During partition, regions on the wrong side become unavailable.
ZooKeeper Leader-based consensus (ZAB protocol). Writes go through leader; if leader is partitioned, the minority partition cannot serve writes.
etcd Raft consensus. Requires majority quorum for writes. Minority partition becomes read-only.

AP Systems — Choose Availability Over Consistency

When a partition occurs, AP systems continue serving requests but may return stale or conflicting data.

System How It Achieves AP
DynamoDB Sloppy quorum + hinted handoff. Writes succeed even if some replicas are unreachable. Conflicts resolved via last-writer-wins or application-level reconciliation.
Cassandra Tunable consistency (ONE, QUORUM, ALL). At consistency level ONE, it is AP — any single replica can serve reads/writes during partition.
CouchDB Multi-master replication. Accepts writes on any node during partition. Detects and exposes conflicts for application resolution.

CA Is a Myth in Distributed Systems

A CA system would require that no partitions ever occur. This is only possible on a single node (e.g., a single PostgreSQL instance). The moment you add a second node, you must handle partitions.

The Spectrum Reality

Most production systems are not binary CP or AP. They sit on a spectrum with tunable consistency:

  • Cassandra with QUORUM reads/writes behaves as CP for those operations
  • DynamoDB with strong consistency reads enabled behaves as CP for reads
  • CockroachDB is CP by default but can sacrifice consistency for availability with follower reads

PACELC Theorem

CAP only describes behavior during a partition. The PACELC theorem (Abadi, 2012) extends it: when there is no partition (E), you must still choose between Latency and Consistency.

  If Partition (P):             Else (E):
    Choose A or C                 Choose L or C

  ┌─────────────────┐          ┌─────────────────┐
  │  PA: Available   │          │  EL: Low Latency │
  │  PC: Consistent  │          │  EC: Consistent  │
  └─────────────────┘          └─────────────────┘
System Partition Behavior Normal Behavior PACELC
DynamoDB PA (available) EL (low latency) PA/EL
Cassandra PA (available) EL (low latency) PA/EL
Spanner PC (consistent) EC (consistent) PC/EC
MongoDB PA (available) EC (consistent) PA/EC
CockroachDB PC (consistent) EL (low latency with follower reads) PC/EL

Interview tip: When asked about CAP, mention PACELC to show deeper understanding. Most real systems optimize for the "no partition" case since partitions are rare.

Consistency Models Spectrum

From strongest to weakest:

Model Guarantee Example System
Linearizability Real-time ordering. Once a write completes, all subsequent reads (by any client) see it. Spanner, etcd
Sequential Consistency All operations appear in some total order consistent with each client's program order. ZooKeeper (per-session)
Causal Consistency Causally related operations are ordered. Concurrent operations may appear in any order. MongoDB (causal sessions)
Read-your-writes A client always sees its own writes. Other clients may see stale data. DynamoDB (strong reads)
Eventual Consistency All replicas converge to the same state eventually. No ordering guarantees during convergence. DynamoDB (default), Cassandra (ONE), DNS

Interview distinction: Linearizability is about real-time ordering (wall clock). Sequential consistency is about a logical total order consistent with program order. Many candidates conflate these two.

Raft Consensus Algorithm

Raft is the consensus algorithm most interviewers expect you to know. It is equivalent to Paxos in safety but designed to be understandable.

Three Roles

  • Leader: Handles all client requests, replicates log entries to followers
  • Follower: Passive — responds to leader's RPCs and votes in elections
  • Candidate: A follower that has started an election to become leader

Leader Election

  Term 1              Term 2               Term 3
  ┌────────┐         ┌────────┐           ┌────────┐
  │Leader A │──X──────│Election│───────────│Leader C │
  │         │ (crash) │  B, C  │           │         │
  └────────┘         │  vote  │           └────────┘
                     └────────┘

  Step-by-step:
  1. Leader A sends heartbeats every 150ms
  2. Leader A crashes — followers stop receiving heartbeats
  3. Follower C's election timeout fires (randomized: 150-300ms)
  4. C increments term to 2, transitions to Candidate, votes for itself
  5. C sends RequestVote RPCs to all other nodes
  6. Each node votes for at most ONE candidate per term
  7. C receives votes from majority (including itself) → becomes Leader
  8. C sends heartbeats to establish authority, preventing new elections

Key rules:

  • Election timeout is randomized (150-300ms) to avoid split votes
  • A node votes for the first candidate it hears from in a given term
  • A candidate must have a log at least as up-to-date as the voter's log (election restriction)

Log Replication

  Client         Leader          Follower 1      Follower 2
    │               │                │                │
    │──Write X=5───>│                │                │
    │               │──AppendEntries─>│                │
    │               │──AppendEntries──────────────────>│
    │               │                │                │
    │               │<──Success──────│                │
    │               │<──Success─────────────────────── │
    │               │                │                │
    │               │  Majority confirmed (2/3)       │
    │               │  Commit entry, advance commitIndex
    │<──OK──────────│                │                │
    │               │──Heartbeat (commitIndex)──────> │
    │               │                │  Apply entry   │
  1. Client sends write to Leader
  2. Leader appends entry to its log with current term number
  3. Leader sends AppendEntries RPCs to all followers
  4. When a majority acknowledges, the entry is committed
  5. Leader responds to client after commit
  6. Followers apply committed entries on the next heartbeat

Safety Guarantees

  • Election restriction: Only a candidate with all committed entries can win an election
  • Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical
  • Leader completeness: A committed entry is present in the log of every future leader

Paxos vs. Raft

Aspect Raft Paxos
Understandability Designed for clarity Notoriously difficult
Leader Single leader required Can operate leaderless (Multi-Paxos uses leader for performance)
Phases Two phases: election + replication Three phases: prepare, promise, accept
Practical use etcd, Consul, CockroachDB Chubby (Google), Spanner (uses Paxos groups)

Interview tip: "Raft is equivalent to Multi-Paxos in terms of safety and liveness, but it decomposes the problem into leader election, log replication, and safety — making it easier to reason about."

Vector Clocks and Lamport Timestamps

Lamport Timestamps

A logical clock that establishes happens-before ordering:

  Node A:  [1] ──────── [2] ──────── [4]
                   │                   ▲
                   │    send(msg)      │
                   ▼                   │
  Node B:  [1] ── [3] ──────── [5] ───┘
                                 ▼  send(msg)
  Node C:  [1] ──────── [2] ── [6]

  Rules:
  1. Before any event, increment local counter
  2. When sending: attach current timestamp
  3. When receiving: local = max(local, received) + 1

Limitation: If timestamp(A) < timestamp(B), you cannot conclude A happened before B. They might be concurrent. Lamport timestamps only guarantee: if A happened-before B, then timestamp(A) < timestamp(B) (the converse is not true).

Vector Clocks

Vector clocks track the logical time per node, which allows detecting concurrent events:

  Node A:  [1,0,0] ─── [2,0,0] ──────────── [3,2,0]
                                    ▲              │
                          send      │    receive   │
                                    │              ▼
  Node B:  [0,1,0] ─── [0,2,0] ───────────── [3,3,0]
                              ▼  send
  Node C:  [0,0,1] ─── [0,2,2]

  Comparison:
  - [2,0,0] vs [0,2,0] → CONCURRENT (neither dominates)
  - [2,0,0] vs [3,2,0] → [2,0,0] happened-before [3,2,0]
  - VC1 < VC2 iff every element of VC1 <= VC2 and at least one is strictly less

Used by Amazon's Dynamo paper for conflict detection: when two writes have concurrent vector clocks, the system knows there is a conflict and can present both versions for reconciliation.

Interview takeaway: Lamport timestamps give you total ordering (but cannot detect concurrency). Vector clocks detect concurrency (but do not give total ordering). Choose based on what you need.

Next: Concurrency & Multithreading — race conditions, deadlocks, and language-specific patterns that dominate coding rounds. :::

Quiz

Module 5 Quiz: Distributed Systems & Reliability

Take Quiz