LLM Inference Fundamentals
Speculative Decoding for Faster Generation
Speculative decoding breaks the sequential bottleneck of autoregressive generation, achieving 2-3x speedups without quality loss.
The Sequential Problem
Recall that token generation is inherently sequential—each token depends on the previous:
Standard Autoregressive:
Token 1 → Token 2 → Token 3 → Token 4 → Token 5
↓ ↓ ↓ ↓ ↓
GPU GPU GPU GPU GPU
Pass Pass Pass Pass Pass
↓ ↓ ↓ ↓ ↓
Total: 5 sequential GPU forward passes
This limits generation speed regardless of GPU compute power.
Speculative Decoding Concept
Use a small, fast "draft" model to predict multiple tokens, then verify with the large model:
Speculative Decoding:
┌─────────────────────────────────────────────────────────┐
│ Step 1: Draft Model generates k tokens quickly │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Draft (68M params): T1 → T2 → T3 → T4 → T5 │ │
│ │ Time: ~5ms total (very fast) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Step 2: Target Model verifies all tokens in parallel │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Target (70B params): [T1, T2, T3, T4, T5] │ │
│ │ Time: ~100ms (single forward pass) │ │
│ │ Result: T1✓ T2✓ T3✓ T4✗ T5- │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Step 3: Accept verified tokens, resample from T4 │
│ Output: T1, T2, T3, T4' (corrected) │
│ Generated 4 tokens in time of ~1.5 forward passes │
│ │
└─────────────────────────────────────────────────────────┘
How Verification Works
The target model verifies draft tokens using rejection sampling:
# Simplified verification algorithm
def verify_tokens(draft_tokens, draft_probs, target_probs):
accepted = []
for i, token in enumerate(draft_tokens):
# Calculate acceptance probability
p_target = target_probs[i][token]
p_draft = draft_probs[i][token]
# Accept if target agrees or with probability ratio
if random.random() < min(1, p_target / p_draft):
accepted.append(token)
else:
# Reject and resample from adjusted distribution
adjusted_prob = max(0, p_target - p_draft)
adjusted_prob /= adjusted_prob.sum()
new_token = sample(adjusted_prob)
accepted.append(new_token)
break # Stop at first rejection
return accepted
Key insight: This maintains the exact same output distribution as standard sampling—no quality loss.
Speculative Decoding Variants
1. Model-Based Speculation
Use a smaller model from same family:
# vLLM with draft model
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
speculative_model="meta-llama/Llama-3.3-8B-Instruct",
num_speculative_tokens=5, # Draft k=5 tokens
)
# Speedup: ~2.5x on typical workloads
2. N-gram Speculation
Use n-gram matching from prompt (zero compute overhead):
# N-gram speculation configuration
# Looks for repeated patterns in input/output
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
speculative_model="[ngram]", # N-gram based
ngram_prompt_lookup_max=4, # Look back 4 tokens
num_speculative_tokens=5,
)
# Best for: Code generation, structured outputs
# Speedup: 1.5-2x with near-zero overhead
3. Medusa (Multi-Head Speculation)
Add speculation heads to target model:
┌─────────────────────────────────────────────┐
│ MEDUSA ARCHITECTURE │
├─────────────────────────────────────────────┤
│ │
│ Target Model Hidden States │
│ │ │
│ ┌─────┴─────┬─────────┬─────────┐ │
│ ▼ ▼ ▼ ▼ │
│ Head 0 Head 1 Head 2 Head 3 │
│ (t+1) (t+2) (t+3) (t+4) │
│ │
│ Each head predicts different future token │
│ Tree-structured verification │
│ No separate draft model needed │
│ │
└─────────────────────────────────────────────┘
4. Eagle (2025 State-of-the-Art)
Advanced speculation with learned predictor:
# Eagle3 achieves highest acceptance rates
# But requires tuning k value carefully
# Research finding (Nov 2025):
# - Wrong k can INCREASE costs by 175%
# - Optimal k=1 gives 20-54% cost reduction
# - Combined with FP8: best overall results
Production Configuration
Optimal settings depend on workload:
| Workload | Recommended Method | k Value | Expected Speedup |
|---|---|---|---|
| General chat | Model-based | 4-5 | 2-2.5x |
| Code generation | N-gram | 3-5 | 1.5-2x |
| Structured JSON | N-gram | 5-7 | 2-3x |
| Long-form | Model-based | 3-4 | 1.8-2.2x |
# vLLM speculative decoding (January 2026)
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
# Draft model configuration
speculative_model="meta-llama/Llama-3.3-8B-Instruct",
num_speculative_tokens=4,
# Performance tuning
speculative_draft_tensor_parallel_size=1, # TP for draft
use_v2_block_manager=True, # Required for speculation
)
# Sampling params affect speculation efficiency
params = SamplingParams(
temperature=0.7, # Lower temp = higher acceptance rate
top_p=0.9,
max_tokens=1024,
)
Measuring Speculation Efficiency
# Key metrics
acceptance_rate = accepted_tokens / drafted_tokens
# Target: >70% for good speedup
# Effective speedup
speedup = (tokens_generated) / (target_forward_passes)
# Example: 100 tokens in 40 passes = 2.5x speedup
# Cost efficiency (crucial metric)
cost_per_token = (draft_compute + target_compute) / accepted_tokens
# Must be lower than baseline for net benefit
Speculative decoding is now standard in production—combine with quantization and batching for maximum performance.
Next module: Deep dive into vLLM, the leading open-source inference engine. :::