LLM Inference Fundamentals
The LLM Inference Pipeline
Understanding how LLMs generate text is essential for optimizing production systems. Let's explore the inference pipeline that transforms a prompt into a response.
Two-Phase Generation
LLM inference consists of two distinct phases:
┌─────────────────────────────────────────────────────────────────┐
│ LLM INFERENCE PIPELINE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────────────────────┐ │
│ │ PREFILL │ │ DECODE (Autoregressive) │ │
│ │ (Compute │ ───► │ (Memory Bound) │ │
│ │ Bound) │ │ │ │
│ └─────────────────┘ └─────────────────────────────────┘ │
│ │
│ • Process entire • Generate tokens one-by-one │
│ prompt at once • Reuse KV cache from prefill │
│ • Build KV cache • Sequential, memory-limited │
│ • Parallel compute • Dominates total latency │
│ │
└─────────────────────────────────────────────────────────────────┘
Prefill Phase
The prefill phase processes your entire input prompt in parallel:
# Prefill: Process all input tokens simultaneously
# For prompt "What is machine learning?"
# Tokens: [What, is, machine, learning, ?]
# All 5 tokens processed in single forward pass
# Generates Key-Value (KV) cache for each layer
# Compute-bound: GPU utilization is high
Characteristics:
- Time to First Token (TTFT): Measures prefill latency
- Scales with prompt length (longer prompts = longer TTFT)
- GPU compute is the bottleneck
Decode Phase
After prefill, tokens are generated one at a time:
# Decode: Generate tokens sequentially
# Step 1: Generate "Machine"
# Step 2: Generate "learning"
# Step 3: Generate "is"
# ... continues until EOS token
# Each step:
# 1. Compute attention using cached KV values
# 2. Generate next token probability distribution
# 3. Sample next token
# 4. Update KV cache with new token's K,V
Characteristics:
- Time Per Output Token (TPOT): Measures decode speed
- Memory-bound: Reading KV cache dominates
- Cannot be parallelized (each token depends on previous)
Autoregressive Generation
LLMs generate text autoregressively—each new token depends on all previous tokens:
Input: "The capital of France is"
│
▼
┌───────────────┐
│ LLM Forward │
│ Pass │
└───────────────┘
│
┌───────────┴───────────┐
▼ ▼
P("Paris") = 0.95 P("Lyon") = 0.03
│
▼ (Sample)
"Paris" selected
│
▼
Input: "The capital of France is Paris"
│
▼
┌───────────────┐
│ LLM Forward │
│ Pass │
└───────────────┘
│
▼
Next token generated...
Why This Architecture Matters
The two-phase design creates optimization opportunities:
| Phase | Bottleneck | Optimization Strategy |
|---|---|---|
| Prefill | Compute | Tensor parallelism, FlashAttention |
| Decode | Memory bandwidth | KV cache optimization, batching |
Production insight: Most latency comes from decode phase. A 1000-token response requires 1000 sequential forward passes, regardless of GPU speed.
Latency Metrics
Key metrics for production LLM systems:
# Time to First Token (TTFT)
# Time from request receipt to first token generated
ttft = prefill_time + scheduling_overhead
# Time Per Output Token (TPOT)
# Average time to generate each subsequent token
tpot = total_decode_time / num_output_tokens
# End-to-End Latency
e2e_latency = ttft + (tpot * num_output_tokens)
# Throughput
# Tokens generated per second across all requests
throughput = total_tokens_generated / total_time
Real-world targets (as of January 2026):
- TTFT: < 500ms for interactive applications
- TPOT: < 50ms for smooth streaming
- Throughput: 1000+ tokens/second per GPU for batch processing
Understanding these fundamentals prepares you for the optimization techniques ahead.
Next, we'll explore the KV cache—the critical data structure that makes efficient generation possible. :::