Speculative Decoding for Faster Generation

Speculative decoding breaks the sequential bottleneck of autoregressive generation, achieving 2-3x speedups without quality loss.

The Sequential Problem

Recall that token generation is inherently sequential—each token depends on the previous:

Standard Autoregressive:
Token 1 → Token 2 → Token 3 → Token 4 → Token 5
  ↓         ↓         ↓         ↓         ↓
 GPU       GPU       GPU       GPU       GPU
 Pass      Pass      Pass      Pass      Pass
  ↓         ↓         ↓         ↓         ↓
Total: 5 sequential GPU forward passes

This limits generation speed regardless of GPU compute power.

Speculative Decoding Concept

Use a small, fast "draft" model to predict multiple tokens, then verify with the large model:

Speculative Decoding:
┌─────────────────────────────────────────────────────────┐
│  Step 1: Draft Model generates k tokens quickly        │
│  ┌─────────────────────────────────────────────────┐   │
│  │ Draft (68M params): T1 → T2 → T3 → T4 → T5     │   │
│  │ Time: ~5ms total (very fast)                    │   │
│  └─────────────────────────────────────────────────┘   │
│                                                         │
│  Step 2: Target Model verifies all tokens in parallel  │
│  ┌─────────────────────────────────────────────────┐   │
│  │ Target (70B params): [T1, T2, T3, T4, T5]       │   │
│  │ Time: ~100ms (single forward pass)              │   │
│  │ Result: T1✓ T2✓ T3✓ T4✗ T5-                   │   │
│  └─────────────────────────────────────────────────┘   │
│                                                         │
│  Step 3: Accept verified tokens, resample from T4     │
│  Output: T1, T2, T3, T4' (corrected)                  │
│  Generated 4 tokens in time of ~1.5 forward passes    │
│                                                         │
└─────────────────────────────────────────────────────────┘

How Verification Works

The target model verifies draft tokens using rejection sampling:

# Simplified verification algorithm
def verify_tokens(draft_tokens, draft_probs, target_probs):
    accepted = []

    for i, token in enumerate(draft_tokens):
        # Calculate acceptance probability
        p_target = target_probs[i][token]
        p_draft = draft_probs[i][token]

        # Accept if target agrees or with probability ratio
        if random.random() < min(1, p_target / p_draft):
            accepted.append(token)
        else:
            # Reject and resample from adjusted distribution
            adjusted_prob = max(0, p_target - p_draft)
            adjusted_prob /= adjusted_prob.sum()
            new_token = sample(adjusted_prob)
            accepted.append(new_token)
            break  # Stop at first rejection

    return accepted

Key insight: This maintains the exact same output distribution as standard sampling—no quality loss.

Speculative Decoding Variants

1. Model-Based Speculation

Use a smaller model from same family:

# vLLM with draft model
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    speculative_model="meta-llama/Llama-3.3-8B-Instruct",
    num_speculative_tokens=5,  # Draft k=5 tokens
)

# Speedup: ~2.5x on typical workloads

2. N-gram Speculation

Use n-gram matching from prompt (zero compute overhead):

# N-gram speculation configuration
# Looks for repeated patterns in input/output

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    speculative_model="[ngram]",  # N-gram based
    ngram_prompt_lookup_max=4,    # Look back 4 tokens
    num_speculative_tokens=5,
)

# Best for: Code generation, structured outputs
# Speedup: 1.5-2x with near-zero overhead

3. Medusa (Multi-Head Speculation)

Add speculation heads to target model:

┌─────────────────────────────────────────────┐
│           MEDUSA ARCHITECTURE               │
├─────────────────────────────────────────────┤
│                                             │
│  Target Model Hidden States                 │
│           │                                 │
│     ┌─────┴─────┬─────────┬─────────┐      │
│     ▼           ▼         ▼         ▼      │
│  Head 0      Head 1    Head 2    Head 3    │
│  (t+1)       (t+2)     (t+3)     (t+4)     │
│                                             │
│  Each head predicts different future token  │
│  Tree-structured verification               │
│  No separate draft model needed             │
│                                             │
└─────────────────────────────────────────────┘

4. Eagle (2025 State-of-the-Art)

Advanced speculation with learned predictor:

# Eagle3 achieves highest acceptance rates
# But requires tuning k value carefully

# Research finding (Nov 2025):
# - Wrong k can INCREASE costs by 175%
# - Optimal k=1 gives 20-54% cost reduction
# - Combined with FP8: best overall results

Production Configuration

Optimal settings depend on workload:

Workload	Recommended Method	k Value	Expected Speedup
General chat	Model-based	4-5	2-2.5x
Code generation	N-gram	3-5	1.5-2x
Structured JSON	N-gram	5-7	2-3x
Long-form	Model-based	3-4	1.8-2.2x

# vLLM speculative decoding (January 2026)
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",

    # Draft model configuration
    speculative_model="meta-llama/Llama-3.3-8B-Instruct",
    num_speculative_tokens=4,

    # Performance tuning
    speculative_draft_tensor_parallel_size=1,  # TP for draft
    use_v2_block_manager=True,  # Required for speculation
)

# Sampling params affect speculation efficiency
params = SamplingParams(
    temperature=0.7,  # Lower temp = higher acceptance rate
    top_p=0.9,
    max_tokens=1024,
)

Measuring Speculation Efficiency

# Key metrics
acceptance_rate = accepted_tokens / drafted_tokens
# Target: >70% for good speedup

# Effective speedup
speedup = (tokens_generated) / (target_forward_passes)
# Example: 100 tokens in 40 passes = 2.5x speedup

# Cost efficiency (crucial metric)
cost_per_token = (draft_compute + target_compute) / accepted_tokens
# Must be lower than baseline for net benefit

Speculative decoding is now standard in production—combine with quantization and batching for maximum performance.

Next module: Deep dive into vLLM, the leading open-source inference engine. :::