vLLM Architecture & Core Concepts

vLLM has become the de facto standard for open-source LLM inference. Let's explore its architecture and why it dominates the ecosystem.

vLLM Evolution

Key milestones:

June 2023: Initial release with PagedAttention
January 2025: vLLM V1 alpha—major architectural upgrade with 1.7x speedup
May 2025: Joined PyTorch Foundation as hosted project
December 2025: Release 25.11 with CUDA 13.0.2, PyTorch 2.9.0

As of January 2026, vLLM is the most widely deployed open-source LLM serving solution.

Core Architecture

┌─────────────────────────────────────────────────────────────────┐
│                      vLLM ARCHITECTURE                          │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │   API       │     │  Scheduler  │     │   Worker    │       │
│  │   Server    │────►│   (Core)    │────►│   (GPU)     │       │
│  │ (FastAPI)   │     │             │     │             │       │
│  └─────────────┘     └─────────────┘     └─────────────┘       │
│        │                   │                   │                │
│        │                   │                   │                │
│        ▼                   ▼                   ▼                │
│  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐       │
│  │   Request   │     │   Block     │     │   Model     │       │
│  │   Queue     │     │   Manager   │     │   Runner    │       │
│  │             │     │ (PagedAttn) │     │             │       │
│  └─────────────┘     └─────────────┘     └─────────────┘       │
│                                                                 │
│  Key Innovations:                                               │
│  • PagedAttention for memory efficiency                        │
│  • Continuous batching for throughput                          │
│  • Tensor/Pipeline parallelism for large models               │
│  • Speculative decoding for latency                           │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

vLLM V1 Architecture (2025)

The V1 release brought significant improvements:

# vLLM V1 Key Changes
# 1. Clean, optimized execution loop
# 2. Zero-overhead prefix caching
# 3. Enhanced multimodal support
# 4. PyTorch 2.x native integration

# V1 is now the default in vLLM
from vllm import LLM

# Automatic V1 features enabled
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    # V1 optimizations applied automatically
)

Quick Start

Installation

# Recommended: Use uv for faster installation (vLLM recommendation)
pip install uv
uv pip install vllm

# Or standard pip (Python 3.12+ recommended)
pip install vllm

# For NVIDIA containers (production)
docker pull vllm/vllm-openai:latest

Basic Usage

from vllm import LLM, SamplingParams

# Initialize model
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",
    tensor_parallel_size=4,  # Use 4 GPUs
)

# Define sampling parameters
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=1024,
)

# Generate completions
prompts = [
    "Explain quantum computing in simple terms:",
    "Write a Python function to sort a list:",
]

outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

OpenAI-Compatible Server

# Start vLLM server with OpenAI API compatibility
vllm serve meta-llama/Llama-3.3-70B-Instruct \
    --tensor-parallel-size 4 \
    --port 8000 \
    --api-key your-api-key

# Use with OpenAI client
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="your-api-key"
)

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "Hello!"}
    ]
)

Hardware Support

vLLM supports diverse hardware (as of January 2026):

Platform	Support Level	Notes
NVIDIA GPUs	Full	Primary target, best performance
AMD GPUs	Production	ROCm support
Intel Gaudi	Plugin	Via habana_frameworks
Google TPU	Production	tpu-inference 0.12.0
Intel CPUs	Beta	For development/testing
Apple Silicon	Community	Limited support

# GPU selection
llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",

    # NVIDIA GPU configuration
    tensor_parallel_size=4,
    dtype="float16",  # Or "bfloat16" for Ampere+

    # For AMD GPUs
    # device="hip"

    # For CPU (testing only)
    # device="cpu"
)

Model Support

vLLM supports most popular model architectures:

# Supported model families (non-exhaustive)
supported_models = [
    # Meta Llama family
    "meta-llama/Llama-3.3-70B-Instruct",
    "meta-llama/Llama-3.1-405B-Instruct",

    # Mistral/Mixtral
    "mistralai/Mistral-7B-Instruct-v0.3",
    "mistralai/Mixtral-8x22B-Instruct-v0.1",

    # Qwen
    "Qwen/Qwen2.5-72B-Instruct",

    # DeepSeek
    "deepseek-ai/DeepSeek-V3",  # Record performance on Blackwell

    # Multimodal
    "llava-hf/llava-v1.6-mistral-7b-hf",
    "Qwen/Qwen2-VL-72B-Instruct",
]

# Check model support
from vllm import LLM
llm = LLM(model="your-model")  # Raises error if unsupported

Key Configuration Options

from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.3-70B-Instruct",

    # Parallelism
    tensor_parallel_size=4,      # GPUs for tensor parallelism
    pipeline_parallel_size=1,    # Stages for pipeline parallelism

    # Memory
    gpu_memory_utilization=0.9,  # % of GPU memory to use
    max_model_len=32768,         # Max context length

    # Quantization
    quantization="fp8",          # Or "awq", "gptq", "squeezellm"

    # Caching
    enable_prefix_caching=True,  # Share KV cache for common prefixes

    # Performance
    enforce_eager=False,         # Use CUDA graphs (faster)
    enable_chunked_prefill=True, # Better scheduling
)

Understanding vLLM's architecture prepares you for advanced configuration and optimization.

Next, we'll explore production deployment patterns for vLLM. :::