LLM Inference Fundamentals
KV Cache: The Memory Optimization Foundation
The Key-Value (KV) cache is the most important data structure in LLM inference. Understanding it unlocks all advanced optimization techniques.
Why KV Cache Exists
Without KV caching, each token generation would recompute attention for the entire sequence:
Without KV Cache (Inefficient):
─────────────────────────────
Token 1: Compute attention for [1] → 1 computation
Token 2: Compute attention for [1,2] → 2 computations
Token 3: Compute attention for [1,2,3] → 3 computations
Token 4: Compute attention for [1,2,3,4] → 4 computations
...
Token N: Compute attention for [1,2,...,N] → N computations
Total: 1+2+3+...+N = O(N²) computations
With KV Cache (Efficient):
─────────────────────────
Token 1: Compute K,V for [1], store in cache → 1 computation
Token 2: Use cached K,V, compute only for [2] → 1 computation
Token 3: Use cached K,V, compute only for [3] → 1 computation
Token 4: Use cached K,V, compute only for [4] → 1 computation
...
Token N: Use cached K,V, compute only for [N] → 1 computation
Total: N computations = O(N)
Impact: KV caching reduces compute from O(N²) to O(N) for sequence length N.
How KV Cache Works
In transformer attention, each token produces Key (K) and Value (V) vectors:
# Attention mechanism
# Q = Query (from current token)
# K = Key (from all tokens)
# V = Value (from all tokens)
# Attention(Q, K, V) = softmax(Q @ K.T / sqrt(d_k)) @ V
# During decode, instead of recomputing K,V for all tokens:
# 1. Retrieve cached K,V from previous tokens
# 2. Compute K,V only for new token
# 3. Concatenate: K_new = [K_cached, K_current]
# 4. Compute attention with full K,V
# 5. Store K_current, V_current in cache
KV Cache Memory Requirements
KV cache memory grows with:
- Sequence length
- Batch size
- Number of layers
- Number of attention heads
- Head dimension
# KV Cache Memory Formula
# memory = 2 × batch × layers × heads × seq_len × head_dim × bytes
# Example: Llama 3.3 70B with 80 layers, 64 heads, 128 head_dim
# For batch=1, seq_len=4096, FP16 (2 bytes):
memory_bytes = 2 * 1 * 80 * 64 * 4096 * 128 * 2
memory_gb = memory_bytes / (1024**3)
# ≈ 10.7 GB just for KV cache!
# For batch=32:
# ≈ 343 GB - exceeds most GPU memory
Key insight: KV cache often consumes more memory than model weights during inference.
Memory Fragmentation Problem
Traditional KV cache allocation suffers from fragmentation:
┌──────────────────────────────────────────────────────────┐
│ GPU MEMORY (Fragmented) │
├──────────────────────────────────────────────────────────┤
│ │
│ [Request A: 512 tokens] [EMPTY] [Request B: 256] │
│ ████████████████████████ ░░░░░ ████████████ │
│ │
│ [EMPTY] [Request C: 128] [EMPTY] [Request D: 64] │
│ ░░░░░░ ████████ ░░░░░░ ████ │
│ │
│ Problem: 30% memory wasted in gaps │
│ New 300-token request won't fit despite space │
│ │
└──────────────────────────────────────────────────────────┘
This fragmentation limits batch sizes and throughput.
PagedAttention Solution
PagedAttention (introduced by vLLM) treats KV cache like OS virtual memory:
┌──────────────────────────────────────────────────────────┐
│ PagedAttention MEMORY MANAGEMENT │
├──────────────────────────────────────────────────────────┤
│ │
│ Physical Blocks (Fixed Size, e.g., 16 tokens each): │
│ ┌────┬────┬────┬────┬────┬────┬────┬────┐ │
│ │ B0 │ B1 │ B2 │ B3 │ B4 │ B5 │ B6 │ B7 │ │
│ └────┴────┴────┴────┴────┴────┴────┴────┘ │
│ │
│ Request A (512 tokens) → Blocks [0,1,2,3,...] │
│ Request B (256 tokens) → Blocks [32,33,34,...] │
│ Request C (128 tokens) → Blocks [48,49,...] │
│ │
│ Block Table (Virtual → Physical mapping): │
│ Request A: [0,1,2,...,31] │
│ Request B: [32,33,34,...,47] │
│ Request C: [48,49,...,55] │
│ │
│ Benefits: │
│ • Near-zero memory waste │
│ • Blocks allocated on-demand │
│ • Easy memory sharing for beam search │
│ • 2-4x more requests in same memory │
│ │
└──────────────────────────────────────────────────────────┘
KV Cache Optimization Techniques
1. Grouped-Query Attention (GQA)
Reduces KV cache by sharing K,V across query heads:
# Multi-Head Attention (MHA): 64 heads → 64 K,V pairs
# Grouped-Query Attention (GQA): 64 heads → 8 K,V pairs (8 groups)
# Memory reduction: 8x smaller KV cache
# Llama 3 uses GQA with 8 KV heads
# 70B model: 10.7 GB → 1.3 GB KV cache for same sequence
2. Multi-Query Attention (MQA)
Extreme case: all query heads share single K,V:
# Multi-Query Attention: 64 heads → 1 K,V pair
# Memory reduction: 64x smaller KV cache
# Trade-off: Some quality degradation
3. KV Cache Quantization
Reduce precision of cached values:
# FP16 KV cache: 2 bytes per value
# INT8 KV cache: 1 byte per value → 2x compression
# INT4 KV cache: 0.5 bytes per value → 4x compression
# Example: vLLM supports FP8 KV cache
# 70B model, 4096 context: 10.7 GB → 5.35 GB
4. KV Cache Compression
Dynamic compression based on attention patterns:
# Entropy-guided approach (2025 research):
# - Identify "important" tokens via attention entropy
# - Keep full precision for important tokens
# - Compress or evict less important tokens
# - Results: 4.18% memory reduction, preserved quality
Measuring KV Cache Efficiency
Key metrics:
# KV Cache Hit Rate (for prefix caching)
hit_rate = cached_tokens / total_tokens
# Memory Utilization
utilization = used_kv_memory / allocated_kv_memory
# Effective Batch Size (with PagedAttention)
effective_batch = actual_batch_size / theoretical_max_batch
# Target: >90% utilization
Understanding KV cache is foundational—every optimization technique builds on this.
Next, we'll explore batching strategies that maximize GPU utilization. :::