KV Cache: The Memory Optimization Foundation

The Key-Value (KV) cache is the most important data structure in LLM inference. Understanding it unlocks all advanced optimization techniques.

Why KV Cache Exists

Without KV caching, each token generation would recompute attention for the entire sequence:

Without KV Cache (Inefficient):
─────────────────────────────
Token 1: Compute attention for [1]           → 1 computation
Token 2: Compute attention for [1,2]         → 2 computations
Token 3: Compute attention for [1,2,3]       → 3 computations
Token 4: Compute attention for [1,2,3,4]     → 4 computations
...
Token N: Compute attention for [1,2,...,N]   → N computations

Total: 1+2+3+...+N = O(N²) computations

With KV Cache (Efficient):
─────────────────────────
Token 1: Compute K,V for [1], store in cache → 1 computation
Token 2: Use cached K,V, compute only for [2] → 1 computation
Token 3: Use cached K,V, compute only for [3] → 1 computation
Token 4: Use cached K,V, compute only for [4] → 1 computation
...
Token N: Use cached K,V, compute only for [N] → 1 computation

Total: N computations = O(N)

Impact: KV caching reduces compute from O(N²) to O(N) for sequence length N.

How KV Cache Works

In transformer attention, each token produces Key (K) and Value (V) vectors:

# Attention mechanism
# Q = Query (from current token)
# K = Key (from all tokens)
# V = Value (from all tokens)

# Attention(Q, K, V) = softmax(Q @ K.T / sqrt(d_k)) @ V

# During decode, instead of recomputing K,V for all tokens:
# 1. Retrieve cached K,V from previous tokens
# 2. Compute K,V only for new token
# 3. Concatenate: K_new = [K_cached, K_current]
# 4. Compute attention with full K,V
# 5. Store K_current, V_current in cache

KV Cache Memory Requirements

KV cache memory grows with:

Sequence length
Batch size
Number of layers
Number of attention heads
Head dimension

# KV Cache Memory Formula
# memory = 2 × batch × layers × heads × seq_len × head_dim × bytes

# Example: Llama 3.3 70B with 80 layers, 64 heads, 128 head_dim
# For batch=1, seq_len=4096, FP16 (2 bytes):
memory_bytes = 2 * 1 * 80 * 64 * 4096 * 128 * 2
memory_gb = memory_bytes / (1024**3)
# ≈ 10.7 GB just for KV cache!

# For batch=32:
# ≈ 343 GB - exceeds most GPU memory

Key insight: KV cache often consumes more memory than model weights during inference.

Memory Fragmentation Problem

Traditional KV cache allocation suffers from fragmentation:

┌──────────────────────────────────────────────────────────┐
│                   GPU MEMORY (Fragmented)                │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  [Request A: 512 tokens]  [EMPTY]  [Request B: 256]     │
│  ████████████████████████  ░░░░░   ████████████         │
│                                                          │
│  [EMPTY]  [Request C: 128]  [EMPTY]  [Request D: 64]   │
│  ░░░░░░   ████████          ░░░░░░   ████               │
│                                                          │
│  Problem: 30% memory wasted in gaps                     │
│  New 300-token request won't fit despite space          │
│                                                          │
└──────────────────────────────────────────────────────────┘

This fragmentation limits batch sizes and throughput.

PagedAttention Solution

PagedAttention (introduced by vLLM) treats KV cache like OS virtual memory:

┌──────────────────────────────────────────────────────────┐
│              PagedAttention MEMORY MANAGEMENT            │
├──────────────────────────────────────────────────────────┤
│                                                          │
│  Physical Blocks (Fixed Size, e.g., 16 tokens each):    │
│  ┌────┬────┬────┬────┬────┬────┬────┬────┐              │
│  │ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │              │
│  └────┴────┴────┴────┴────┴────┴────┴────┘              │
│                                                          │
│  Request A (512 tokens) → Blocks [0,1,2,3,...]          │
│  Request B (256 tokens) → Blocks [32,33,34,...]         │
│  Request C (128 tokens) → Blocks [48,49,...]            │
│                                                          │
│  Block Table (Virtual → Physical mapping):               │
│  Request A: [0,1,2,...,31]                              │
│  Request B: [32,33,34,...,47]                           │
│  Request C: [48,49,...,55]                              │
│                                                          │
│  Benefits:                                               │
│  • Near-zero memory waste                               │
│  • Blocks allocated on-demand                           │
│  • Easy memory sharing for beam search                  │
│  • 2-4x more requests in same memory                    │
│                                                          │
└──────────────────────────────────────────────────────────┘

KV Cache Optimization Techniques

1. Grouped-Query Attention (GQA)

Reduces KV cache by sharing K,V across query heads:

# Multi-Head Attention (MHA): 64 heads → 64 K,V pairs
# Grouped-Query Attention (GQA): 64 heads → 8 K,V pairs (8 groups)
# Memory reduction: 8x smaller KV cache

# Llama 3 uses GQA with 8 KV heads
# 70B model: 10.7 GB → 1.3 GB KV cache for same sequence

2. Multi-Query Attention (MQA)

Extreme case: all query heads share single K,V:

# Multi-Query Attention: 64 heads → 1 K,V pair
# Memory reduction: 64x smaller KV cache
# Trade-off: Some quality degradation

3. KV Cache Quantization

Reduce precision of cached values:

# FP16 KV cache: 2 bytes per value
# INT8 KV cache: 1 byte per value → 2x compression
# INT4 KV cache: 0.5 bytes per value → 4x compression

# Example: vLLM supports FP8 KV cache
# 70B model, 4096 context: 10.7 GB → 5.35 GB

4. KV Cache Compression

Dynamic compression based on attention patterns:

# Entropy-guided approach (2025 research):
# - Identify "important" tokens via attention entropy
# - Keep full precision for important tokens
# - Compress or evict less important tokens
# - Results: 4.18% memory reduction, preserved quality

Measuring KV Cache Efficiency

Key metrics:

# KV Cache Hit Rate (for prefix caching)
hit_rate = cached_tokens / total_tokens

# Memory Utilization
utilization = used_kv_memory / allocated_kv_memory

# Effective Batch Size (with PagedAttention)
effective_batch = actual_batch_size / theoretical_max_batch
# Target: >90% utilization

Understanding KV cache is foundational—every optimization technique builds on this.

Next, we'll explore batching strategies that maximize GPU utilization. :::