The LLM Inference Pipeline

Understanding how LLMs generate text is essential for optimizing production systems. Let's explore the inference pipeline that transforms a prompt into a response.

Two-Phase Generation

LLM inference consists of two distinct phases:

┌─────────────────────────────────────────────────────────────────┐
│                    LLM INFERENCE PIPELINE                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────────┐      ┌─────────────────────────────────┐  │
│  │   PREFILL       │      │        DECODE (Autoregressive)  │  │
│  │   (Compute      │ ───► │        (Memory Bound)           │  │
│  │    Bound)       │      │                                 │  │
│  └─────────────────┘      └─────────────────────────────────┘  │
│                                                                 │
│  • Process entire       • Generate tokens one-by-one           │
│    prompt at once       • Reuse KV cache from prefill          │
│  • Build KV cache       • Sequential, memory-limited           │
│  • Parallel compute     • Dominates total latency              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Prefill Phase

The prefill phase processes your entire input prompt in parallel:

# Prefill: Process all input tokens simultaneously
# For prompt "What is machine learning?"
# Tokens: [What, is, machine, learning, ?]

# All 5 tokens processed in single forward pass
# Generates Key-Value (KV) cache for each layer
# Compute-bound: GPU utilization is high

Characteristics:

Time to First Token (TTFT): Measures prefill latency
Scales with prompt length (longer prompts = longer TTFT)
GPU compute is the bottleneck

Decode Phase

After prefill, tokens are generated one at a time:

# Decode: Generate tokens sequentially
# Step 1: Generate "Machine"
# Step 2: Generate "learning"
# Step 3: Generate "is"
# ... continues until EOS token

# Each step:
# 1. Compute attention using cached KV values
# 2. Generate next token probability distribution
# 3. Sample next token
# 4. Update KV cache with new token's K,V

Characteristics:

Time Per Output Token (TPOT): Measures decode speed
Memory-bound: Reading KV cache dominates
Cannot be parallelized (each token depends on previous)

Autoregressive Generation

LLMs generate text autoregressively—each new token depends on all previous tokens:

Input: "The capital of France is"
                    │
                    ▼
           ┌───────────────┐
           │  LLM Forward  │
           │     Pass      │
           └───────────────┘
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
    P("Paris") = 0.95      P("Lyon") = 0.03
        │
        ▼ (Sample)
    "Paris" selected
        │
        ▼
Input: "The capital of France is Paris"
                    │
                    ▼
           ┌───────────────┐
           │  LLM Forward  │
           │     Pass      │
           └───────────────┘
                    │
                    ▼
    Next token generated...

Why This Architecture Matters

The two-phase design creates optimization opportunities:

Phase	Bottleneck	Optimization Strategy
Prefill	Compute	Tensor parallelism, FlashAttention
Decode	Memory bandwidth	KV cache optimization, batching

Production insight: Most latency comes from decode phase. A 1000-token response requires 1000 sequential forward passes, regardless of GPU speed.

Latency Metrics

Key metrics for production LLM systems:

# Time to First Token (TTFT)
# Time from request receipt to first token generated
ttft = prefill_time + scheduling_overhead

# Time Per Output Token (TPOT)
# Average time to generate each subsequent token
tpot = total_decode_time / num_output_tokens

# End-to-End Latency
e2e_latency = ttft + (tpot * num_output_tokens)

# Throughput
# Tokens generated per second across all requests
throughput = total_tokens_generated / total_time

Real-world targets (as of January 2026):

TTFT: < 500ms for interactive applications
TPOT: < 50ms for smooth streaming
Throughput: 1000+ tokens/second per GPU for batch processing

Understanding these fundamentals prepares you for the optimization techniques ahead.

Next, we'll explore the KV cache—the critical data structure that makes efficient generation possible. :::