Distributed Systems Fundamentals

Cloud architects must understand distributed systems theory to design resilient, scalable architectures. These concepts appear frequently in system design interviews.

CAP Theorem

The foundation of distributed systems decision-making.

Definition

A distributed system can only guarantee two of three properties:

Consistency: All nodes see the same data at the same time
Availability: Every request receives a response (success or failure)
Partition Tolerance: System continues operating despite network partitions

CAP in Practice

Database	CAP Choice	Trade-off
Traditional RDBMS	CA	Cannot handle partitions well
MongoDB	CP	May reject writes during partition
Cassandra	AP	May return stale data
DynamoDB	Configurable	Choose per-operation
Spanner	CP (with high A)	Google's global network minimizes partitions

Interview Question: CAP in Real Systems

Q: "How does AWS DynamoDB handle CAP theorem trade-offs?"

A: DynamoDB offers configurable consistency:

Eventually Consistent Reads (default): AP behavior, higher availability, lower latency
Strongly Consistent Reads: CP behavior, guaranteed latest data, higher latency
Transactions: Strong consistency with ACID guarantees, highest latency

Design Implication: Use eventually consistent reads for user-facing read-heavy operations, strongly consistent for critical operations (payments, inventory).

PACELC Theorem

An extension of CAP for normal operation scenarios.

Definition

If there's a Partition, choose Availability or Consistency. Else (normal operation), choose Latency or Consistency.

PACELC Classification

System	Partition (A/C)	Else (L/C)	Behavior
DynamoDB	A	L	Fast, eventual consistency
Cassandra	A	L	Tunable consistency
MongoDB	C	C	Strong consistency
Spanner	C	C	Strong consistency, global
CockroachDB	C	L	Serializable, low latency

Consistency Models

Eventual Consistency

Updates propagate asynchronously
Reads may return stale data temporarily
System converges to consistent state

Use Cases: Social media feeds, product catalogs, session data

Strong Consistency

Reads always return the latest write
Higher latency, lower throughput
Required for financial transactions

Use Cases: Banking, inventory management, user authentication

Causal Consistency

Related operations maintain order
Independent operations may be out of order
Balance between eventual and strong

Use Cases: Collaborative editing, messaging systems

Distributed Consensus

Raft Algorithm

Used by: etcd, Consul, CockroachDB

How Raft Works:

Leader Election: One node becomes leader
Log Replication: Leader replicates entries to followers
Safety: Only leader with complete log can be elected

Paxos Algorithm

Used by: Google Spanner, Chubby

More complex than Raft, but mathematically proven safe.

Interview Question: Leader Election

Q: "How would you implement leader election for a distributed job scheduler?"

A: Options by complexity:

Approach	Complexity	Use Case
Cloud-native (DynamoDB locks)	Low	AWS-only, simple
etcd/Consul	Medium	Kubernetes-native
Zookeeper	High	Legacy, Kafka
Custom Raft	Very High	Research/custom needs

Recommended: Use managed services (DynamoDB, Consul) unless you have specific requirements.

Data Replication Patterns

Synchronous Replication

Client → Primary → Replica1 → Replica2 → ACK → Client

Strong consistency
Higher latency
Lower availability during failures

Asynchronous Replication

Client → Primary → ACK → Client
         ↓ (async)
     Replica1, Replica2

Lower latency
Eventual consistency
Potential data loss on primary failure

Semi-Synchronous Replication

Client → Primary → Replica1 → ACK → Client
         ↓ (async)
      Replica2

Balance of consistency and availability
At least one replica has latest data
Common in production databases

Sharding Strategies

Range-Based Sharding

Users A-M → Shard 1
Users N-Z → Shard 2

Good for range queries
Risk of hot spots (popular letters)

Hash-Based Sharding

hash(user_id) % num_shards → Shard

Even distribution
Range queries require scatter-gather

Consistent Hashing

hash ring with virtual nodes

Minimal redistribution on scale
Used by DynamoDB, Cassandra

Interview Question: Sharding Strategy

Q: "Design a sharding strategy for a social media platform's posts."

A: Consider access patterns:

Primary access: Get posts by user (user timeline)
Secondary: Get posts by time (global feed)

Strategy: Compound shard key (user_id, created_at)

Posts for a user are co-located
Time-based queries still efficient within user
Add secondary index for global time-based queries

Key Insight: Distributed systems design is about trade-offs. Always articulate what you're optimizing for and what you're sacrificing.

Next, we'll explore microservices architecture patterns. :::

CAP Theorem

Definition

CAP in Practice

Interview Question: CAP in Real Systems

PACELC Theorem

Definition

PACELC Classification

Consistency Models

Eventual Consistency

Strong Consistency

Causal Consistency

Distributed Consensus

Raft Algorithm

Paxos Algorithm

Interview Question: Leader Election

Data Replication Patterns

Synchronous Replication

Asynchronous Replication

Semi-Synchronous Replication

Sharding Strategies

Range-Based Sharding

Hash-Based Sharding

Consistent Hashing

Interview Question: Sharding Strategy

Quiz

Stay on the Nerd Track