Performance & Optimization
vLLM for Production
3 min read
When you need maximum throughput and concurrent users, vLLM is the production-grade inference server. Learn when to upgrade from Ollama.
Ollama vs vLLM
┌─────────────────────────────────────────────────────────────────┐
│ Ollama vs vLLM Comparison │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Aspect │ Ollama │ vLLM │
│ ──────────────────│────────────────│───────────────────────── │
│ Setup │ One command │ Requires configuration │
│ Throughput │ Good │ Excellent (2-4x higher) │
│ Concurrent users │ Limited │ Designed for scale │
│ Batching │ Basic │ Continuous batching │
│ Memory efficiency │ Good │ PagedAttention (better) │
│ Model support │ GGUF only │ Many formats │
│ Best for │ Development │ Production servers │
│ │
└─────────────────────────────────────────────────────────────────┘
When to Consider vLLM
Stay with Ollama if:
- Single user or small team
- Development and testing
- Simplicity is priority
- Memory-constrained environment
Move to vLLM if:
- Multiple concurrent users (>5-10)
- High throughput requirements
- Production deployment
- Need maximum tokens/second
Installing vLLM
# Install vLLM (requires NVIDIA GPU with CUDA)
pip install vllm
# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121
Running vLLM Server
# Start vLLM with a model from HuggingFace
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--port 8000
# With quantization (AWQ)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-7B-AWQ \
--quantization awq \
--port 8000
# With GPU memory limit
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--gpu-memory-utilization 0.9 \
--port 8000
vLLM Configuration Options
# vllm_config.py
from vllm import LLM, SamplingParams
# Initialize vLLM engine
llm = LLM(
model="meta-llama/Llama-3.2-3B-Instruct",
tensor_parallel_size=1, # GPUs to use
gpu_memory_utilization=0.9, # % of GPU memory
max_model_len=4096, # Max context length
dtype="auto", # Precision (auto, float16, bfloat16)
)
# Configure sampling
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256,
)
# Generate
outputs = llm.generate(["What is machine learning?"], sampling_params)
print(outputs[0].outputs[0].text)
OpenAI-Compatible API
vLLM provides an OpenAI-compatible endpoint.
from openai import OpenAI
# Point to vLLM server
client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="not-needed" # vLLM doesn't require API key by default
)
# Use exactly like OpenAI
response = client.chat.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
messages=[
{"role": "user", "content": "Explain quantum computing"}
],
temperature=0.7,
max_tokens=256
)
print(response.choices[0].message.content)
Benchmarking vLLM vs Ollama
import time
import requests
import concurrent.futures
def benchmark_endpoint(url: str, payload: dict, num_requests: int = 50):
"""Benchmark an inference endpoint with concurrent requests."""
def make_request():
start = time.time()
response = requests.post(url, json=payload)
return time.time() - start, response.status_code
start_total = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
results = list(executor.map(lambda _: make_request(), range(num_requests)))
total_time = time.time() - start_total
latencies = [r[0] for r in results]
success_count = sum(1 for r in results if r[1] == 200)
return {
"total_time": total_time,
"requests_per_second": num_requests / total_time,
"avg_latency": sum(latencies) / len(latencies),
"success_rate": success_count / num_requests
}
# Benchmark Ollama
ollama_result = benchmark_endpoint(
"http://localhost:11434/api/generate",
{"model": "llama3.2", "prompt": "Hello", "stream": False}
)
# Benchmark vLLM
vllm_result = benchmark_endpoint(
"http://localhost:8000/v1/completions",
{"model": "meta-llama/Llama-3.2-3B-Instruct", "prompt": "Hello", "max_tokens": 50}
)
print(f"Ollama: {ollama_result['requests_per_second']:.1f} req/s")
print(f"vLLM: {vllm_result['requests_per_second']:.1f} req/s")
PagedAttention Advantage
vLLM's key innovation is PagedAttention for efficient memory management.
┌─────────────────────────────────────────────────────────────────┐
│ PagedAttention Memory Management │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Traditional (Ollama): │
│ ┌────────────────────────────────────────┐ │
│ │ Request 1: [██████████░░░░░░░░░░] 50% │ Pre-allocated │
│ │ Request 2: [████████████░░░░░░░░] 60% │ wasted space │
│ └────────────────────────────────────────┘ │
│ │
│ PagedAttention (vLLM): │
│ ┌────────────────────────────────────────┐ │
│ │ [R1][R2][R1][R2][R1][R2]... │ Dynamic paging │
│ │ Memory shared efficiently │ ~2-4x more users │
│ └────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
Docker Deployment with vLLM
# docker-compose.yml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
ports:
- "8000:8000"
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model meta-llama/Llama-3.2-3B-Instruct
--gpu-memory-utilization 0.9
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
# Start vLLM with Docker
docker compose up -d
# Check logs
docker compose logs -f vllm
Hybrid Architecture
Use both Ollama and vLLM for different needs.
import os
class HybridInference:
"""Route requests between Ollama (dev) and vLLM (prod)."""
def __init__(self):
self.env = os.getenv("ENVIRONMENT", "development")
self.ollama_url = "http://localhost:11434"
self.vllm_url = "http://localhost:8000"
def generate(self, prompt: str, **kwargs):
if self.env == "production":
return self._vllm_generate(prompt, **kwargs)
return self._ollama_generate(prompt, **kwargs)
def _ollama_generate(self, prompt: str, **kwargs):
import ollama
response = ollama.generate(model="llama3.2", prompt=prompt, **kwargs)
return response["response"]
def _vllm_generate(self, prompt: str, **kwargs):
from openai import OpenAI
client = OpenAI(base_url=f"{self.vllm_url}/v1", api_key="none")
response = client.completions.create(
model="meta-llama/Llama-3.2-3B-Instruct",
prompt=prompt,
max_tokens=kwargs.get("num_predict", 256)
)
return response.choices[0].text
# Usage
inference = HybridInference()
result = inference.generate("Explain transformers in ML")
Decision Matrix
| Requirement | Ollama | vLLM |
|---|---|---|
| Quick setup | Best choice | More complex |
| Development | Best choice | Overkill |
| Production API | Possible | Best choice |
| High concurrency | Limited | Best choice |
| Memory efficiency | Good | Best choice |
| CPU inference | Supported | GPU only |
vLLM unlocks production-grade performance. In the next module, we'll deploy local LLMs with Docker and scaling strategies. :::