Consistency, Availability & Consensus

Distributed systems are the backbone of every large-scale backend. This lesson covers the fundamental theorems, consistency models, and consensus algorithms that interviewers at L5+ expect you to reason about fluently.

CAP Theorem Deep Dive

The CAP theorem (Brewer, 2000) states that a distributed data store can only guarantee two of three properties simultaneously:

Consistency (C): Every read receives the most recent write or an error
Availability (A): Every request receives a non-error response (no guarantee it contains the most recent write)
Partition Tolerance (P): The system continues operating despite network partitions between nodes

Partition Tolerance Is Not Optional

In any distributed system, network partitions will happen. Cables get cut, switches fail, cloud availability zones lose connectivity. Since you cannot avoid partitions, the real choice is between CP and AP when a partition occurs.

                    CAP Theorem

        C ─────────────────────── A
        │         You can          │
        │        pick two          │
        │                          │
        │     ┌──────────┐         │
        │     │    CA     │         │
        │     │ (single   │         │
        │     │  node     │         │
        │     │  only)    │         │
        │     └──────────┘         │
        │                          │
        └──────────── P ───────────┘
              (always present
               in distributed
               systems)

CP Systems — Choose Consistency Over Availability

When a partition occurs, CP systems reject requests rather than return stale data.

System	How It Achieves CP
Google Spanner	TrueTime API (GPS + atomic clocks) provides external consistency. Commits wait for uncertainty interval to pass, guaranteeing global ordering.
HBase	Single master for each region. During partition, regions on the wrong side become unavailable.
ZooKeeper	Leader-based consensus (ZAB protocol). Writes go through leader; if leader is partitioned, the minority partition cannot serve writes.
etcd	Raft consensus. Requires majority quorum for writes. Minority partition becomes read-only.

AP Systems — Choose Availability Over Consistency

When a partition occurs, AP systems continue serving requests but may return stale or conflicting data.

System	How It Achieves AP
DynamoDB	Sloppy quorum + hinted handoff. Writes succeed even if some replicas are unreachable. Conflicts resolved via last-writer-wins or application-level reconciliation.
Cassandra	Tunable consistency (ONE, QUORUM, ALL). At consistency level ONE, it is AP — any single replica can serve reads/writes during partition.
CouchDB	Multi-master replication. Accepts writes on any node during partition. Detects and exposes conflicts for application resolution.

CA Is a Myth in Distributed Systems

A CA system would require that no partitions ever occur. This is only possible on a single node (e.g., a single PostgreSQL instance). The moment you add a second node, you must handle partitions.

The Spectrum Reality

Most production systems are not binary CP or AP. They sit on a spectrum with tunable consistency:

Cassandra with QUORUM reads/writes behaves as CP for those operations
DynamoDB with strong consistency reads enabled behaves as CP for reads
CockroachDB is CP by default but can sacrifice consistency for availability with follower reads

PACELC Theorem

CAP only describes behavior during a partition. The PACELC theorem (Abadi, 2012) extends it: when there is no partition (E), you must still choose between Latency and Consistency.

  If Partition (P):             Else (E):
    Choose A or C                 Choose L or C

  ┌─────────────────┐          ┌─────────────────┐
  │  PA: Available   │          │  EL: Low Latency │
  │  PC: Consistent  │          │  EC: Consistent  │
  └─────────────────┘          └─────────────────┘

System	Partition Behavior	Normal Behavior	PACELC
DynamoDB	PA (available)	EL (low latency)	PA/EL
Cassandra	PA (available)	EL (low latency)	PA/EL
Spanner	PC (consistent)	EC (consistent)	PC/EC
MongoDB	PA (available)	EC (consistent)	PA/EC
CockroachDB	PC (consistent)	EL (low latency with follower reads)	PC/EL

Interview tip: When asked about CAP, mention PACELC to show deeper understanding. Most real systems optimize for the "no partition" case since partitions are rare.

Consistency Models Spectrum

From strongest to weakest:

Model	Guarantee	Example System
Linearizability	Real-time ordering. Once a write completes, all subsequent reads (by any client) see it.	Spanner, etcd
Sequential Consistency	All operations appear in some total order consistent with each client's program order.	ZooKeeper (per-session)
Causal Consistency	Causally related operations are ordered. Concurrent operations may appear in any order.	MongoDB (causal sessions)
Read-your-writes	A client always sees its own writes. Other clients may see stale data.	DynamoDB (strong reads)
Eventual Consistency	All replicas converge to the same state eventually. No ordering guarantees during convergence.	DynamoDB (default), Cassandra (ONE), DNS

Interview distinction: Linearizability is about real-time ordering (wall clock). Sequential consistency is about a logical total order consistent with program order. Many candidates conflate these two.

Raft Consensus Algorithm

Raft is the consensus algorithm most interviewers expect you to know. It is equivalent to Paxos in safety but designed to be understandable.

Three Roles

Leader: Handles all client requests, replicates log entries to followers
Follower: Passive — responds to leader's RPCs and votes in elections
Candidate: A follower that has started an election to become leader

Leader Election

  Term 1              Term 2               Term 3
  ┌────────┐         ┌────────┐           ┌────────┐
  │Leader A │──X──────│Election│───────────│Leader C │
  │         │ (crash) │  B, C  │           │         │
  └────────┘         │  vote  │           └────────┘
                     └────────┘

  Step-by-step:
  1. Leader A sends heartbeats every 150ms
  2. Leader A crashes — followers stop receiving heartbeats
  3. Follower C's election timeout fires (randomized: 150-300ms)
  4. C increments term to 2, transitions to Candidate, votes for itself
  5. C sends RequestVote RPCs to all other nodes
  6. Each node votes for at most ONE candidate per term
  7. C receives votes from majority (including itself) → becomes Leader
  8. C sends heartbeats to establish authority, preventing new elections

Key rules:

Election timeout is randomized (150-300ms) to avoid split votes
A node votes for the first candidate it hears from in a given term
A candidate must have a log at least as up-to-date as the voter's log (election restriction)

Log Replication

  Client         Leader          Follower 1      Follower 2
    │               │                │                │
    │──Write X=5───>│                │                │
    │               │──AppendEntries─>│                │
    │               │──AppendEntries──────────────────>│
    │               │                │                │
    │               │<──Success──────│                │
    │               │<──Success─────────────────────── │
    │               │                │                │
    │               │  Majority confirmed (2/3)       │
    │               │  Commit entry, advance commitIndex
    │<──OK──────────│                │                │
    │               │──Heartbeat (commitIndex)──────> │
    │               │                │  Apply entry   │

Client sends write to Leader
Leader appends entry to its log with current term number
Leader sends AppendEntries RPCs to all followers
When a majority acknowledges, the entry is committed
Leader responds to client after commit
Followers apply committed entries on the next heartbeat

Safety Guarantees

Election restriction: Only a candidate with all committed entries can win an election
Log matching: If two logs contain an entry with the same index and term, all preceding entries are identical
Leader completeness: A committed entry is present in the log of every future leader

Paxos vs. Raft

Aspect	Raft	Paxos
Understandability	Designed for clarity	Notoriously difficult
Leader	Single leader required	Can operate leaderless (Multi-Paxos uses leader for performance)
Phases	Two phases: election + replication	Three phases: prepare, promise, accept
Practical use	etcd, Consul, CockroachDB	Chubby (Google), Spanner (uses Paxos groups)

Interview tip: "Raft is equivalent to Multi-Paxos in terms of safety and liveness, but it decomposes the problem into leader election, log replication, and safety — making it easier to reason about."

Vector Clocks and Lamport Timestamps

Lamport Timestamps

A logical clock that establishes happens-before ordering:

  Node A:  [1] ──────── [2] ──────── [4]
                   │                   ▲
                   │    send(msg)      │
                   ▼                   │
  Node B:  [1] ── [3] ──────── [5] ───┘
                                 │
                                 ▼  send(msg)
  Node C:  [1] ──────── [2] ── [6]

  Rules:
  1. Before any event, increment local counter
  2. When sending: attach current timestamp
  3. When receiving: local = max(local, received) + 1

Limitation: If timestamp(A) < timestamp(B), you cannot conclude A happened before B. They might be concurrent. Lamport timestamps only guarantee: if A happened-before B, then timestamp(A) < timestamp(B) (the converse is not true).

Vector Clocks

Vector clocks track the logical time per node, which allows detecting concurrent events:

  Node A:  [1,0,0] ─── [2,0,0] ──────────── [3,2,0]
                                    ▲              │
                          send      │    receive   │
                                    │              ▼
  Node B:  [0,1,0] ─── [0,2,0] ───────────── [3,3,0]
                              │
                              ▼  send
  Node C:  [0,0,1] ─── [0,2,2]

  Comparison:
  - [2,0,0] vs [0,2,0] → CONCURRENT (neither dominates)
  - [2,0,0] vs [3,2,0] → [2,0,0] happened-before [3,2,0]
  - VC1 < VC2 iff every element of VC1 <= VC2 and at least one is strictly less

Used by Amazon's Dynamo paper for conflict detection: when two writes have concurrent vector clocks, the system knows there is a conflict and can present both versions for reconciliation.

Interview takeaway: Lamport timestamps give you total ordering (but cannot detect concurrency). Vector clocks detect concurrency (but do not give total ordering). Choose based on what you need.

Next: Concurrency & Multithreading — race conditions, deadlocks, and language-specific patterns that dominate coding rounds. :::