Distributed Systems & Reliability
Concurrency & Multithreading
Concurrency bugs are among the hardest to reproduce and debug. Interviewers love them because they reveal whether you truly understand shared state, coordination, and the guarantees your language runtime provides. This lesson covers the three categories of concurrency problems, locking strategies, and language-specific patterns for Go, Java, and Python.
Three Categories of Concurrency Problems
| Category | Problem | Example |
|---|---|---|
| Correctness | Shared state corruption | Two threads incrementing a counter — final value is wrong |
| Coordination | Hand-off and waiting | Producer creates work faster than consumer processes it |
| Scarcity | Resource limits | Database connection pool exhausted, new requests time out |
Every concurrency interview question falls into one (or more) of these categories. Identifying the category immediately narrows your solution space.
Mutexes and Locks
A mutex (mutual exclusion) ensures only one thread accesses a critical section at a time.
// Go: sync.Mutex
var mu sync.Mutex
var balance int
func Deposit(amount int) {
mu.Lock()
defer mu.Unlock() // always unlock, even if panic occurs
balance += amount
}
func Balance() int {
mu.Lock()
defer mu.Unlock()
return balance
}
// Java: ReentrantLock (preferred over synchronized for tryLock, fairness)
private final ReentrantLock lock = new ReentrantLock();
private int balance = 0;
public void deposit(int amount) {
lock.lock();
try {
balance += amount;
} finally {
lock.unlock(); // always in finally block
}
}
# Python: threading.Lock
import threading
lock = threading.Lock()
balance = 0
def deposit(amount: int) -> None:
with lock: # context manager handles acquire/release
global balance
balance += amount
Read-Write Locks
When reads vastly outnumber writes, a read-write lock allows concurrent reads but exclusive writes:
// Go: sync.RWMutex
var rwmu sync.RWMutex
var config map[string]string
func GetConfig(key string) string {
rwmu.RLock() // multiple goroutines can hold RLock
defer rwmu.RUnlock()
return config[key]
}
func SetConfig(key, value string) {
rwmu.Lock() // exclusive — blocks all readers and writers
defer rwmu.Unlock()
config[key] = value
}
When to use: Read-write locks help when the read-to-write ratio is 10:1 or higher. Below that, the overhead of managing two lock types is not worth it — use a simple mutex.
Race Condition Patterns
Check-Then-Act (TOCTOU)
The Time-Of-Check-To-Time-Of-Use bug: you check a condition, then act on it, but the condition changes between check and act.
# BROKEN: race condition between check and act
def withdraw(amount: int) -> bool:
if balance >= amount: # Thread A checks: balance=100, amount=80
# Thread B runs withdraw(50) here — balance becomes 50
balance -= amount # Thread A subtracts 80 from 50 = -30!
return True
return False
# FIX: hold lock across both check and act
def withdraw(amount: int) -> bool:
with lock:
if balance >= amount:
balance -= amount
return True
return False
Read-Modify-Write
Multiple threads read a value, modify it, then write it back — lost updates.
// BROKEN: counter++ is actually three operations (read, increment, write)
private int counter = 0;
public void increment() {
counter++; // NOT atomic! Thread A reads 5, Thread B reads 5,
// both write 6 — one increment lost
}
// FIX: use AtomicInteger with CAS (Compare-And-Swap)
private final AtomicInteger counter = new AtomicInteger(0);
public void increment() {
counter.incrementAndGet(); // atomic CAS operation in a loop
}
Compare-And-Swap (CAS)
CAS is the hardware primitive behind lock-free programming:
CAS(memory_location, expected_value, new_value):
atomically:
if *memory_location == expected_value:
*memory_location = new_value
return true
else:
return false // another thread modified it — retry
ABA Problem: Thread reads value A, another thread changes it to B then back to A, and the first thread's CAS succeeds even though the value was modified. Solution: tagged pointers (append a version counter to the value).
Deadlock
A deadlock occurs when two or more threads are waiting for each other to release locks, and none can proceed.
Four Necessary Conditions (Coffman, 1971)
All four must hold simultaneously for deadlock to occur:
- Mutual exclusion: Resources cannot be shared
- Hold and wait: A thread holds one lock while waiting for another
- No preemption: Locks cannot be forcibly taken from a thread
- Circular wait: Thread A waits for B, B waits for A
The Dining Philosophers Problem:
Philosopher 0 Philosopher 1
│ │
▼ ▼
[Fork 0] ◄──────── [Fork 1]
│ │
│ Philosopher 2 │
│ │ │
└────► [Fork 2] ◄───┘
Each philosopher picks up left fork, then tries right fork.
If all pick up left fork simultaneously → deadlock.
Nobody can get their right fork because their neighbor holds it.
Prevention Strategies
| Strategy | Break Which Condition | Implementation |
|---|---|---|
| Lock ordering | Circular wait | Always acquire locks in a consistent global order (e.g., by resource ID) |
| Timeout | Hold and wait | Use tryLock(timeout) — release all locks if timeout expires |
| Try-lock | Hold and wait | tryLock() returns false if lock unavailable — back off and retry |
| Single lock | Hold and wait | Use one coarse-grained lock instead of multiple fine-grained locks |
// Go: lock ordering to prevent deadlock
func Transfer(from, to *Account, amount int) {
// Always lock the account with the lower ID first
first, second := from, to
if from.ID > to.ID {
first, second = to, from
}
first.mu.Lock()
defer first.mu.Unlock()
second.mu.Lock()
defer second.mu.Unlock()
from.Balance -= amount
to.Balance += amount
}
Deadlock Detection
Build a wait-for graph: nodes are threads, edges point from a waiting thread to the thread holding the desired lock. A cycle in this graph means deadlock.
Wait-for graph:
Thread A ──waits-for──► Thread B
▲ │
│ │
└────waits-for──────────┘
Cycle detected → deadlock!
Optimistic vs. Pessimistic Locking
Pessimistic Locking
Lock the resource before reading. No other transaction can modify it until you are done.
-- Pessimistic: lock the row before reading
BEGIN;
SELECT balance FROM accounts WHERE id = 42 FOR UPDATE;
-- Row is now locked — other transactions block here
UPDATE accounts SET balance = balance - 100 WHERE id = 42;
COMMIT;
Optimistic Locking
Read without locking, check for conflicts on write using a version column.
-- Optimistic: read version, check on write
SELECT balance, version FROM accounts WHERE id = 42;
-- balance=500, version=3
-- Later, attempt the update:
UPDATE accounts
SET balance = 400, version = 4
WHERE id = 42 AND version = 3;
-- If 0 rows affected → version changed → conflict! Retry.
When to Use Each
| Scenario | Best Choice | Why |
|---|---|---|
| High contention (many writers on same row) | Pessimistic | Retries are expensive when conflicts are frequent |
| Low contention (rare conflicts) | Optimistic | No lock overhead; conflicts are rare so retries are cheap |
| Long transactions | Pessimistic | Holding a version for minutes increases conflict probability |
| Short, fast transactions | Optimistic | Minimal window for conflicts |
| Read-heavy workload | Optimistic | No locks on reads — maximum throughput |
Lock-Free Data Structures
Lock-free algorithms guarantee that at least one thread makes progress in a finite number of steps (even if other threads are stalled). They use CAS operations instead of locks.
Common examples:
- Atomic counters:
AtomicInteger.incrementAndGet()in Java,atomic.AddInt64()in Go - Lock-free queues: Michael-Scott queue (used in Java's
ConcurrentLinkedQueue) - Lock-free hash maps: Split into segments with independent CAS operations (
ConcurrentHashMapin Java)
Language-Specific Concurrency
Go: Goroutines and Channels
Go's concurrency model is based on CSP (Communicating Sequential Processes): share memory by communicating, not by shared state.
func FanOut(urls []string) []Result {
ch := make(chan Result, len(urls)) // buffered channel
var wg sync.WaitGroup
for _, url := range urls {
wg.Add(1)
go func(u string) { // goroutine: lightweight (~4KB stack)
defer wg.Done()
ch <- fetch(u)
}(url)
}
// Close channel when all goroutines complete
go func() {
wg.Wait()
close(ch)
}()
var results []Result
for r := range ch { // range over channel until closed
results = append(results, r)
}
return results
}
Key primitives:
go func(): Spawns a goroutine (~4KB stack, grows as needed, millions possible)chan T: Typed channel for communication between goroutinesselect: Multiplexes across multiple channels (like epoll for channels)sync.WaitGroup: Waits for a collection of goroutines to finisherrgroup.Group: Like WaitGroup but propagates the first error
Java: Virtual Threads (Java 21+)
Java 21 introduced virtual threads — lightweight threads managed by the JVM, not the OS.
// Virtual threads: millions possible (unlike platform threads ~1MB each)
try (var executor = Executors.newVirtualThreadPerTaskExecutor()) {
List<Future<String>> futures = urls.stream()
.map(url -> executor.submit(() -> fetch(url)))
.toList();
List<String> results = new ArrayList<>();
for (var future : futures) {
results.add(future.get()); // blocks the virtual thread, not OS thread
}
}
// CompletableFuture: composable async operations
CompletableFuture<String> future = CompletableFuture
.supplyAsync(() -> fetchUser(userId))
.thenApply(user -> enrichProfile(user))
.thenApply(profile -> serialize(profile))
.exceptionally(ex -> fallbackResponse(ex));
Key primitives:
- Virtual threads:
Thread.ofVirtual().start(runnable)orExecutors.newVirtualThreadPerTaskExecutor() CompletableFuture: Composable async withthenApply,thenCompose,allOfConcurrentHashMap: Thread-safe hash map with segment-level lockingsynchronizedvsReentrantLock:ReentrantLockaddstryLock(), fairness, andConditionvariables
Python: The GIL and Workarounds
The Global Interpreter Lock (GIL) allows only one thread to execute Python bytecode at a time. This means threads do not provide parallelism for CPU-bound work.
import asyncio
import concurrent.futures
# I/O-bound: use asyncio (cooperative multitasking, single thread)
async def fetch_all(urls: list[str]) -> list[str]:
async with aiohttp.ClientSession() as session:
tasks = [fetch(session, url) for url in urls]
return await asyncio.gather(*tasks) # concurrent I/O
# CPU-bound: use multiprocessing (separate processes, no GIL)
def process_all(items: list[dict]) -> list[dict]:
with concurrent.futures.ProcessPoolExecutor() as executor:
return list(executor.map(heavy_computation, items))
# I/O-bound: threading works (GIL is released during I/O waits)
def download_all(urls: list[str]) -> list[bytes]:
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
return list(executor.map(download, urls))
| Workload | Best Approach | Why |
|---|---|---|
| I/O-bound (HTTP, DB) | asyncio or threading |
GIL released during I/O waits |
| CPU-bound (computation) | multiprocessing |
Separate processes bypass GIL |
| Mixed | asyncio + ProcessPoolExecutor |
Async I/O with process pool for CPU work |
Connection Pooling
Opening a database connection is expensive (TCP handshake, TLS, authentication). Connection pools maintain a set of reusable connections.
Sizing Formula
connections = (CPU_cores * 2) + effective_spindle_count
- For SSDs, effective spindle count is typically 1
- A 4-core server:
(4 * 2) + 1 = 9connections - More connections than this causes context switching overhead and contention on the database
Common mistake: Setting pool size to 100+ "for safety." This actually degrades performance because the database spends more time context-switching between connections than executing queries. PostgreSQL recommends using PgBouncer with a smaller pool.
Pool Configuration
# Typical HikariCP (Java) pool configuration
pool:
minimum_idle: 5 # keep 5 idle connections warm
maximum_pool_size: 10 # hard cap
connection_timeout: 30s # wait for a free connection
idle_timeout: 600s # close idle connections after 10 min
max_lifetime: 1800s # replace connections every 30 min
Key decisions:
- min idle: Set to your baseline load to avoid cold-start latency
- max pool: Set using the formula above — resist the urge to increase
- connection timeout: Fail fast (30s) rather than queue indefinitely
- max lifetime: Rotate connections to prevent stale state and respect database-side timeouts
Next: Reliability & Observability — SLOs, circuit breakers, distributed tracing, and chaos engineering. :::