vLLM & Open-Source Inference Engines

Prefix Caching & Advanced Optimization

3 min read

Prefix caching dramatically reduces latency for applications with repeated system prompts or context. Let's explore this critical optimization.

Why Prefix Caching Matters

Many LLM applications share common prefixes:

Request 1: [System prompt] + "What is Python?"
Request 2: [System prompt] + "Explain JavaScript"
Request 3: [System prompt] + "How does Go work?"

Without prefix caching:
  - Each request: Full prefill of [System prompt] + question
  - 1000-token system prompt × 3 requests = 3000 tokens prefilled

With prefix caching:
  - Request 1: Prefill [System prompt], cache it
  - Request 2-3: Reuse cached KV, only prefill questions
  - Total: 1000 + 50 + 50 = 1100 tokens prefilled

  Speedup: ~3x for TTFT

vLLM Prefix Caching (V1)

vLLM V1 introduced zero-overhead prefix caching:

from vllm import LLM, SamplingParams

# Enable prefix caching
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    enable_prefix_caching=True,  # Zero overhead in V1
)

# Shared system prompt
system_prompt = """You are a helpful AI assistant specialized in programming.
Follow best practices. Provide clear explanations with code examples.
Always consider edge cases and error handling."""

# Multiple requests with same prefix
questions = [
    "How do I sort a list in Python?",
    "What's the difference between async and sync?",
    "Explain decorators in Python",
]

sampling = SamplingParams(max_tokens=500, temperature=0.7)

for q in questions:
    prompt = f"{system_prompt}\n\nUser: {q}\nAssistant:"
    output = llm.generate([prompt], sampling)
    # Second+ requests reuse cached prefix

Automatic Prefix Detection

vLLM automatically detects and caches common prefixes:

┌─────────────────────────────────────────────────────────┐
│              PREFIX CACHE MECHANISM                     │
├─────────────────────────────────────────────────────────┤
│                                                         │
│  Hash-based prefix matching:                            │
│                                                         │
│  1. Compute hash of prompt prefix blocks               │
│  2. Check if hash exists in cache                      │
│  3. If hit: Reuse KV cache blocks                      │
│  4. If miss: Compute and cache for future              │
│                                                         │
│  Block-level granularity (e.g., 16 tokens/block):     │
│                                                         │
│  Prompt: "System: You are helpful. User: Hi"           │
│          [Block 0: hash_a] [Block 1: hash_b]           │
│                                                         │
│  New prompt: "System: You are helpful. User: Hello"    │
│              [Block 0: hash_a] [Block 1: hash_c]       │
│              └── Cache hit!    └── Cache miss          │
│                                                         │
│  Only Block 1 needs computation                        │
│                                                         │
└─────────────────────────────────────────────────────────┘

Chat Application Pattern

For multi-turn conversations:

from vllm import LLM, SamplingParams
from vllm.utils import random_uuid

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    enable_prefix_caching=True,
)

class ChatSession:
    def __init__(self, system_prompt: str):
        self.system_prompt = system_prompt
        self.history = []

    def build_prompt(self, user_message: str) -> str:
        messages = [{"role": "system", "content": self.system_prompt}]
        messages.extend(self.history)
        messages.append({"role": "user", "content": user_message})

        # Convert to chat template
        # Each turn adds to prefix, reusing cached earlier turns
        return format_chat(messages)

    def chat(self, user_message: str) -> str:
        prompt = self.build_prompt(user_message)
        output = llm.generate([prompt], SamplingParams(max_tokens=500))
        response = output[0].outputs[0].text

        self.history.append({"role": "user", "content": user_message})
        self.history.append({"role": "assistant", "content": response})

        return response

# Multi-turn conversation
session = ChatSession("You are a Python expert.")
session.chat("What are decorators?")    # Full prefill
session.chat("Show me an example")       # Reuses turn 1 cache
session.chat("How about class decorators?")  # Reuses turns 1-2 cache

Advanced: Prompt Caching Strategies

Strategy 1: Template-Based Caching

# Pre-warm cache with common templates
templates = {
    "code_review": "You are a senior engineer reviewing code...",
    "translation": "You are a professional translator...",
    "summarization": "You are an expert at summarizing documents...",
}

# Warm cache on startup
def warm_cache(llm, templates):
    for name, template in templates.items():
        # Generate minimal output to populate cache
        llm.generate([template + "\nTest"], SamplingParams(max_tokens=1))
    print(f"Warmed {len(templates)} template caches")

warm_cache(llm, templates)

Strategy 2: RAG Context Caching

# For RAG applications with repeated document chunks
class RAGCache:
    def __init__(self, llm):
        self.llm = llm
        self.chunk_cache = {}

    def query_with_context(self, query: str, chunks: list[str]) -> str:
        # Build context from chunks
        context = "\n\n".join(chunks)

        # Context naturally cached via prefix caching
        prompt = f"""Context:
{context}

Question: {query}
Answer based on the context above:"""

        output = self.llm.generate([prompt], SamplingParams(max_tokens=500))
        return output[0].outputs[0].text

# Same chunks in different queries reuse cache
rag = RAGCache(llm)
rag.query_with_context("What is X?", [chunk1, chunk2])  # Cache miss
rag.query_with_context("How does X work?", [chunk1, chunk2])  # Cache hit!

Performance Tuning

# Optimize prefix caching for your workload
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",

    # Prefix caching
    enable_prefix_caching=True,

    # Block size affects cache granularity
    # Smaller = better matching, more overhead
    block_size=16,  # Default, good for most cases

    # Memory for prefix cache
    gpu_memory_utilization=0.9,

    # Scheduling affects cache efficiency
    scheduling_policy="fcfs",  # Helps batch similar prefixes
)

# Monitor cache efficiency
# vllm exposes metrics:
# - vllm_cache_hit_rate
# - vllm_prefix_cache_hit_rate
# - vllm_num_preemption_events

Cache Hit Rate Monitoring

import httpx

async def monitor_cache_metrics(base_url: str):
    async with httpx.AsyncClient() as client:
        response = await client.get(f"{base_url}/metrics")
        metrics = response.text

        # Parse relevant metrics
        cache_hit_rate = parse_metric(metrics, "vllm_cache_hit_rate")
        prefix_hit_rate = parse_metric(metrics, "vllm_prefix_cache_hit_rate")

        return {
            "cache_hit_rate": cache_hit_rate,
            "prefix_hit_rate": prefix_hit_rate,
        }

# Target: >70% prefix cache hit rate for chat applications

Prefix caching is automatic in modern vLLM—the key is structuring prompts to maximize cache reuse.

Next module: TensorRT-LLM for maximum GPU optimization. :::

Quiz

Module 2: vLLM & Open-Source Inference Engines

Take Quiz