vLLM & Open-Source Inference Engines
Prefix Caching & Advanced Optimization
3 min read
Prefix caching dramatically reduces latency for applications with repeated system prompts or context. Let's explore this critical optimization.
Why Prefix Caching Matters
Many LLM applications share common prefixes:
Request 1: [System prompt] + "What is Python?"
Request 2: [System prompt] + "Explain JavaScript"
Request 3: [System prompt] + "How does Go work?"
Without prefix caching:
- Each request: Full prefill of [System prompt] + question
- 1000-token system prompt × 3 requests = 3000 tokens prefilled
With prefix caching:
- Request 1: Prefill [System prompt], cache it
- Request 2-3: Reuse cached KV, only prefill questions
- Total: 1000 + 50 + 50 = 1100 tokens prefilled
Speedup: ~3x for TTFT
vLLM Prefix Caching (V1)
vLLM V1 introduced zero-overhead prefix caching:
from vllm import LLM, SamplingParams
# Enable prefix caching
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
enable_prefix_caching=True, # Zero overhead in V1
)
# Shared system prompt
system_prompt = """You are a helpful AI assistant specialized in programming.
Follow best practices. Provide clear explanations with code examples.
Always consider edge cases and error handling."""
# Multiple requests with same prefix
questions = [
"How do I sort a list in Python?",
"What's the difference between async and sync?",
"Explain decorators in Python",
]
sampling = SamplingParams(max_tokens=500, temperature=0.7)
for q in questions:
prompt = f"{system_prompt}\n\nUser: {q}\nAssistant:"
output = llm.generate([prompt], sampling)
# Second+ requests reuse cached prefix
Automatic Prefix Detection
vLLM automatically detects and caches common prefixes:
┌─────────────────────────────────────────────────────────┐
│ PREFIX CACHE MECHANISM │
├─────────────────────────────────────────────────────────┤
│ │
│ Hash-based prefix matching: │
│ │
│ 1. Compute hash of prompt prefix blocks │
│ 2. Check if hash exists in cache │
│ 3. If hit: Reuse KV cache blocks │
│ 4. If miss: Compute and cache for future │
│ │
│ Block-level granularity (e.g., 16 tokens/block): │
│ │
│ Prompt: "System: You are helpful. User: Hi" │
│ [Block 0: hash_a] [Block 1: hash_b] │
│ │
│ New prompt: "System: You are helpful. User: Hello" │
│ [Block 0: hash_a] [Block 1: hash_c] │
│ └── Cache hit! └── Cache miss │
│ │
│ Only Block 1 needs computation │
│ │
└─────────────────────────────────────────────────────────┘
Chat Application Pattern
For multi-turn conversations:
from vllm import LLM, SamplingParams
from vllm.utils import random_uuid
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
enable_prefix_caching=True,
)
class ChatSession:
def __init__(self, system_prompt: str):
self.system_prompt = system_prompt
self.history = []
def build_prompt(self, user_message: str) -> str:
messages = [{"role": "system", "content": self.system_prompt}]
messages.extend(self.history)
messages.append({"role": "user", "content": user_message})
# Convert to chat template
# Each turn adds to prefix, reusing cached earlier turns
return format_chat(messages)
def chat(self, user_message: str) -> str:
prompt = self.build_prompt(user_message)
output = llm.generate([prompt], SamplingParams(max_tokens=500))
response = output[0].outputs[0].text
self.history.append({"role": "user", "content": user_message})
self.history.append({"role": "assistant", "content": response})
return response
# Multi-turn conversation
session = ChatSession("You are a Python expert.")
session.chat("What are decorators?") # Full prefill
session.chat("Show me an example") # Reuses turn 1 cache
session.chat("How about class decorators?") # Reuses turns 1-2 cache
Advanced: Prompt Caching Strategies
Strategy 1: Template-Based Caching
# Pre-warm cache with common templates
templates = {
"code_review": "You are a senior engineer reviewing code...",
"translation": "You are a professional translator...",
"summarization": "You are an expert at summarizing documents...",
}
# Warm cache on startup
def warm_cache(llm, templates):
for name, template in templates.items():
# Generate minimal output to populate cache
llm.generate([template + "\nTest"], SamplingParams(max_tokens=1))
print(f"Warmed {len(templates)} template caches")
warm_cache(llm, templates)
Strategy 2: RAG Context Caching
# For RAG applications with repeated document chunks
class RAGCache:
def __init__(self, llm):
self.llm = llm
self.chunk_cache = {}
def query_with_context(self, query: str, chunks: list[str]) -> str:
# Build context from chunks
context = "\n\n".join(chunks)
# Context naturally cached via prefix caching
prompt = f"""Context:
{context}
Question: {query}
Answer based on the context above:"""
output = self.llm.generate([prompt], SamplingParams(max_tokens=500))
return output[0].outputs[0].text
# Same chunks in different queries reuse cache
rag = RAGCache(llm)
rag.query_with_context("What is X?", [chunk1, chunk2]) # Cache miss
rag.query_with_context("How does X work?", [chunk1, chunk2]) # Cache hit!
Performance Tuning
# Optimize prefix caching for your workload
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
# Prefix caching
enable_prefix_caching=True,
# Block size affects cache granularity
# Smaller = better matching, more overhead
block_size=16, # Default, good for most cases
# Memory for prefix cache
gpu_memory_utilization=0.9,
# Scheduling affects cache efficiency
scheduling_policy="fcfs", # Helps batch similar prefixes
)
# Monitor cache efficiency
# vllm exposes metrics:
# - vllm_cache_hit_rate
# - vllm_prefix_cache_hit_rate
# - vllm_num_preemption_events
Cache Hit Rate Monitoring
import httpx
async def monitor_cache_metrics(base_url: str):
async with httpx.AsyncClient() as client:
response = await client.get(f"{base_url}/metrics")
metrics = response.text
# Parse relevant metrics
cache_hit_rate = parse_metric(metrics, "vllm_cache_hit_rate")
prefix_hit_rate = parse_metric(metrics, "vllm_prefix_cache_hit_rate")
return {
"cache_hit_rate": cache_hit_rate,
"prefix_hit_rate": prefix_hit_rate,
}
# Target: >70% prefix cache hit rate for chat applications
Prefix caching is automatic in modern vLLM—the key is structuring prompts to maximize cache reuse.
Next module: TensorRT-LLM for maximum GPU optimization. :::