Distributed Systems & Reliability
Consistency, Availability & Consensus
Distributed systems are the backbone of every large-scale backend. This lesson covers the fundamental theorems, consistency models, and consensus algorithms that interviewers at L5+ expect you to reason about fluently.
CAP Theorem Deep Dive
The CAP theorem (Brewer, 2000) states that a distributed data store can only guarantee two of three properties simultaneously:
- Consistency (C): Every read receives the most recent write or an error
- Availability (A): Every request receives a non-error response (no guarantee it contains the most recent write)
- Partition Tolerance (P): The system continues operating despite network partitions between nodes
Partition Tolerance Is Not Optional
In any distributed system, network partitions will happen. Cables get cut, switches fail, cloud availability zones lose connectivity. Since you cannot avoid partitions, the real choice is between CP and AP when a partition occurs.
CAP Theorem
C ─────────────────────── A
│ You can │
│ pick two │
│ │
│ ┌──────────┐ │
│ │ CA │ │
│ │ (single │ │
│ │ node │ │
│ │ only) │ │
│ └──────────┘ │
│ │
└──────────── P ───────────┘
(always present
in distributed
systems)
CP Systems — Choose Consistency Over Availability
When a partition occurs, CP systems reject requests rather than return stale data.
| System | How It Achieves CP |
|---|---|
| Google Spanner | TrueTime API (GPS + atomic clocks) provides external consistency. Commits wait for uncertainty interval to pass, guaranteeing global ordering. |
| HBase | Single master for each region. During partition, regions on the wrong side become unavailable. |
| ZooKeeper | Leader-based consensus (ZAB protocol). Writes go through leader; if leader is partitioned, the minority partition cannot serve writes. |
| etcd | Raft consensus. Requires majority quorum for writes. Minority partition becomes read-only. |
AP Systems — Choose Availability Over Consistency
When a partition occurs, AP systems continue serving requests but may return stale or conflicting data.
| System | How It Achieves AP |
|---|---|
| DynamoDB | Sloppy quorum + hinted handoff. Writes succeed even if some replicas are unreachable. Conflicts resolved via last-writer-wins or application-level reconciliation. |
| Cassandra | Tunable consistency (ONE, QUORUM, ALL). At consistency level ONE, it is AP — any single replica can serve reads/writes during partition. |
| CouchDB | Multi-master replication. Accepts writes on any node during partition. Detects and exposes conflicts for application resolution. |
CA Is a Myth in Distributed Systems
A CA system would require that no partitions ever occur. This is only possible on a single node (e.g., a single PostgreSQL instance). The moment you add a second node, you must handle partitions.
The Spectrum Reality
Most production systems are not binary CP or AP. They sit on a spectrum with tunable consistency:
- Cassandra with
QUORUMreads/writes behaves as CP for those operations - DynamoDB with strong consistency reads enabled behaves as CP for reads
- CockroachDB is CP by default but can sacrifice consistency for availability with follower reads
PACELC Theorem
CAP only describes behavior during a partition. The PACELC theorem (Abadi, 2012) extends it: when there is no partition (E), you must still choose between Latency and Consistency.
If Partition (P): Else (E):
Choose A or C Choose L or C
┌─────────────────┐ ┌─────────────────┐
│ PA: Available │ │ EL: Low Latency │
│ PC: Consistent │ │ EC: Consistent │
└─────────────────┘ └─────────────────┘
| System | Partition Behavior | Normal Behavior | PACELC |
|---|---|---|---|
| DynamoDB | PA (available) | EL (low latency) | PA/EL |
| Cassandra | PA (available) | EL (low latency) | PA/EL |
| Spanner | PC (consistent) | EC (consistent) | PC/EC |
| MongoDB | PA (available) | EC (consistent) | PA/EC |
| CockroachDB | PC (consistent) | EL (low latency with follower reads) | PC/EL |
Interview tip: When asked about CAP, mention PACELC to show deeper understanding. Most real systems optimize for the "no partition" case since partitions are rare.
Consistency Models Spectrum
From strongest to weakest:
| Model | Guarantee | Example System |
|---|---|---|
| Linearizability | Real-time ordering. Once a write completes, all subsequent reads (by any client) see it. | Spanner, etcd |
| Sequential Consistency | All operations appear in some total order consistent with each client's program order. | ZooKeeper (per-session) |
| Causal Consistency | Causally related operations are ordered. Concurrent operations may appear in any order. | MongoDB (causal sessions) |
| Read-your-writes | A client always sees its own writes. Other clients may see stale data. | DynamoDB (strong reads) |
| Eventual Consistency | All replicas converge to the same state eventually. No ordering guarantees during convergence. | DynamoDB (default), Cassandra (ONE), DNS |
Interview distinction: Linearizability is about real-time ordering (wall clock). Sequential consistency is about a logical total order consistent with program order. Many candidates conflate these two.
Raft Consensus Algorithm
Raft is the consensus algorithm most interviewers expect you to know. It is equivalent to Paxos in safety but designed to be understandable.
Three Roles
- Leader: Handles all client requests, replicates log entries to followers
- Follower: Passive — responds to leader's RPCs and votes in elections
- Candidate: A follower that has started an election to become leader
Leader Election
Term 1 Term 2 Term 3
┌────────┐ ┌────────┐ ┌────────┐
│Leader A │──X──────│Election│───────────│Leader C │
│ │ (crash) │ B, C │ │ │
└────────┘ │ vote │ └────────┘
└────────┘
Step-by-step:
1. Leader A sends heartbeats every 150ms
2. Leader A crashes — followers stop receiving heartbeats
3. Follower C's election timeout fires (randomized: 150-300ms)
4. C increments term to 2, transitions to Candidate, votes for itself
5. C sends RequestVote RPCs to all other nodes
6. Each node votes for at most ONE candidate per term
7. C receives votes from majority (including itself) → becomes Leader
8. C sends heartbeats to establish authority, preventing new elections
Key rules:
- Election timeout is randomized (150-300ms) to avoid split votes
- A node votes for the first candidate it hears from in a given term
- A candidate must have a log at least as up-to-date as the voter's log (election restriction)
Log Replication
Client Leader Follower 1 Follower 2
│ │ │ │
│──Write X=5───>│ │ │
│ │──AppendEntries─>│ │
│ │──AppendEntries──────────────────>│
│ │ │ │
│ │<──Success──────│ │
│ │<──Success─────────────────────── │
│ │ │ │
│ │ Majority confirmed (2/3) │
│ │ Commit entry, advance commitIndex
│<──OK──────────│ │ │
│ │──Heartbeat (commitIndex)──────> │
│ │ │ Apply entry │
- Client sends write to Leader
- Leader appends entry to its log with current term number
- Leader sends AppendEntries RPCs to all followers
- When a majority acknowledges, the entry is committed
- Leader responds to client after commit
- Followers apply committed entries on the next heartbeat
Safety Guarantees
- Election restriction: Only a candidate with all committed entries can win an election
- Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical
- Leader completeness: A committed entry is present in the log of every future leader
Paxos vs. Raft
| Aspect | Raft | Paxos |
|---|---|---|
| Understandability | Designed for clarity | Notoriously difficult |
| Leader | Single leader required | Can operate leaderless (Multi-Paxos uses leader for performance) |
| Phases | Two phases: election + replication | Three phases: prepare, promise, accept |
| Practical use | etcd, Consul, CockroachDB | Chubby (Google), Spanner (uses Paxos groups) |
Interview tip: "Raft is equivalent to Multi-Paxos in terms of safety and liveness, but it decomposes the problem into leader election, log replication, and safety — making it easier to reason about."
Vector Clocks and Lamport Timestamps
Lamport Timestamps
A logical clock that establishes happens-before ordering:
Node A: [1] ──────── [2] ──────── [4]
│ ▲
│ send(msg) │
▼ │
Node B: [1] ── [3] ──────── [5] ───┘
│
▼ send(msg)
Node C: [1] ──────── [2] ── [6]
Rules:
1. Before any event, increment local counter
2. When sending: attach current timestamp
3. When receiving: local = max(local, received) + 1
Limitation: If timestamp(A) < timestamp(B), you cannot conclude A happened before B. They might be concurrent. Lamport timestamps only guarantee: if A happened-before B, then timestamp(A) < timestamp(B) (the converse is not true).
Vector Clocks
Vector clocks track the logical time per node, which allows detecting concurrent events:
Node A: [1,0,0] ─── [2,0,0] ──────────── [3,2,0]
▲ │
send │ receive │
│ ▼
Node B: [0,1,0] ─── [0,2,0] ───────────── [3,3,0]
│
▼ send
Node C: [0,0,1] ─── [0,2,2]
Comparison:
- [2,0,0] vs [0,2,0] → CONCURRENT (neither dominates)
- [2,0,0] vs [3,2,0] → [2,0,0] happened-before [3,2,0]
- VC1 < VC2 iff every element of VC1 <= VC2 and at least one is strictly less
Used by Amazon's Dynamo paper for conflict detection: when two writes have concurrent vector clocks, the system knows there is a conflict and can present both versions for reconciliation.
Interview takeaway: Lamport timestamps give you total ordering (but cannot detect concurrency). Vector clocks detect concurrency (but do not give total ordering). Choose based on what you need.
Next: Concurrency & Multithreading — race conditions, deadlocks, and language-specific patterns that dominate coding rounds. :::