vLLM & Open-Source Inference Engines
vLLM Architecture & Core Concepts
4 min read
vLLM has become the de facto standard for open-source LLM inference. Let's explore its architecture and why it dominates the ecosystem.
vLLM Evolution
Key milestones:
- June 2023: Initial release with PagedAttention
- January 2025: vLLM V1 alpha—major architectural upgrade with 1.7x speedup
- May 2025: Joined PyTorch Foundation as hosted project
- December 2025: Release 25.11 with CUDA 13.0.2, PyTorch 2.9.0
As of January 2026, vLLM is the most widely deployed open-source LLM serving solution.
Core Architecture
┌─────────────────────────────────────────────────────────────────┐
│ vLLM ARCHITECTURE │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ API │ │ Scheduler │ │ Worker │ │
│ │ Server │────►│ (Core) │────►│ (GPU) │ │
│ │ (FastAPI) │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │ │ │ │
│ │ │ │ │
│ ▼ ▼ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Request │ │ Block │ │ Model │ │
│ │ Queue │ │ Manager │ │ Runner │ │
│ │ │ │ (PagedAttn) │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ Key Innovations: │
│ • PagedAttention for memory efficiency │
│ • Continuous batching for throughput │
│ • Tensor/Pipeline parallelism for large models │
│ • Speculative decoding for latency │
│ │
└─────────────────────────────────────────────────────────────────┘
vLLM V1 Architecture (2025)
The V1 release brought significant improvements:
# vLLM V1 Key Changes
# 1. Clean, optimized execution loop
# 2. Zero-overhead prefix caching
# 3. Enhanced multimodal support
# 4. PyTorch 2.x native integration
# V1 is now the default in vLLM
from vllm import LLM
# Automatic V1 features enabled
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
# V1 optimizations applied automatically
)
Quick Start
Installation
# Recommended: Use uv for faster installation (vLLM recommendation)
pip install uv
uv pip install vllm
# Or standard pip (Python 3.12+ recommended)
pip install vllm
# For NVIDIA containers (production)
docker pull vllm/vllm-openai:latest
Basic Usage
from vllm import LLM, SamplingParams
# Initialize model
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
tensor_parallel_size=4, # Use 4 GPUs
)
# Define sampling parameters
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=1024,
)
# Generate completions
prompts = [
"Explain quantum computing in simple terms:",
"Write a Python function to sort a list:",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(output.outputs[0].text)
OpenAI-Compatible Server
# Start vLLM server with OpenAI API compatibility
vllm serve meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 4 \
--port 8000 \
--api-key your-api-key
# Use with OpenAI client
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="your-api-key"
)
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "user", "content": "Hello!"}
]
)
Hardware Support
vLLM supports diverse hardware (as of January 2026):
| Platform | Support Level | Notes |
|---|---|---|
| NVIDIA GPUs | Full | Primary target, best performance |
| AMD GPUs | Production | ROCm support |
| Intel Gaudi | Plugin | Via habana_frameworks |
| Google TPU | Production | tpu-inference 0.12.0 |
| Intel CPUs | Beta | For development/testing |
| Apple Silicon | Community | Limited support |
# GPU selection
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
# NVIDIA GPU configuration
tensor_parallel_size=4,
dtype="float16", # Or "bfloat16" for Ampere+
# For AMD GPUs
# device="hip"
# For CPU (testing only)
# device="cpu"
)
Model Support
vLLM supports most popular model architectures:
# Supported model families (non-exhaustive)
supported_models = [
# Meta Llama family
"meta-llama/Llama-3.3-70B-Instruct",
"meta-llama/Llama-3.1-405B-Instruct",
# Mistral/Mixtral
"mistralai/Mistral-7B-Instruct-v0.3",
"mistralai/Mixtral-8x22B-Instruct-v0.1",
# Qwen
"Qwen/Qwen2.5-72B-Instruct",
# DeepSeek
"deepseek-ai/DeepSeek-V3", # Record performance on Blackwell
# Multimodal
"llava-hf/llava-v1.6-mistral-7b-hf",
"Qwen/Qwen2-VL-72B-Instruct",
]
# Check model support
from vllm import LLM
llm = LLM(model="your-model") # Raises error if unsupported
Key Configuration Options
from vllm import LLM
llm = LLM(
model="meta-llama/Llama-3.3-70B-Instruct",
# Parallelism
tensor_parallel_size=4, # GPUs for tensor parallelism
pipeline_parallel_size=1, # Stages for pipeline parallelism
# Memory
gpu_memory_utilization=0.9, # % of GPU memory to use
max_model_len=32768, # Max context length
# Quantization
quantization="fp8", # Or "awq", "gptq", "squeezellm"
# Caching
enable_prefix_caching=True, # Share KV cache for common prefixes
# Performance
enforce_eager=False, # Use CUDA graphs (faster)
enable_chunked_prefill=True, # Better scheduling
)
Understanding vLLM's architecture prepares you for advanced configuration and optimization.
Next, we'll explore production deployment patterns for vLLM. :::