vLLM for Production

When you need maximum throughput and concurrent users, vLLM is the production-grade inference server. Learn when to upgrade from Ollama.

Ollama vs vLLM

┌─────────────────────────────────────────────────────────────────┐
│                   Ollama vs vLLM Comparison                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Aspect            │ Ollama         │ vLLM                      │
│  ──────────────────│────────────────│─────────────────────────  │
│  Setup             │ One command    │ Requires configuration    │
│  Throughput        │ Good           │ Excellent (2-4x higher)   │
│  Concurrent users  │ Limited        │ Designed for scale        │
│  Batching          │ Basic          │ Continuous batching       │
│  Memory efficiency │ Good           │ PagedAttention (better)   │
│  Model support     │ GGUF only      │ Many formats              │
│  Best for          │ Development    │ Production servers        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to Consider vLLM

Stay with Ollama if:

Single user or small team
Development and testing
Simplicity is priority
Memory-constrained environment

Move to vLLM if:

Multiple concurrent users (>5-10)
High throughput requirements
Production deployment
Need maximum tokens/second

Installing vLLM

# Install vLLM (requires NVIDIA GPU with CUDA)
pip install vllm

# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Running vLLM Server

# Start vLLM with a model from HuggingFace
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --port 8000

# With quantization (AWQ)
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-AWQ \
    --quantization awq \
    --port 8000

# With GPU memory limit
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --gpu-memory-utilization 0.9 \
    --port 8000

vLLM Configuration Options

# vllm_config.py
from vllm import LLM, SamplingParams

# Initialize vLLM engine
llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    tensor_parallel_size=1,      # GPUs to use
    gpu_memory_utilization=0.9,  # % of GPU memory
    max_model_len=4096,          # Max context length
    dtype="auto",                # Precision (auto, float16, bfloat16)
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# Generate
outputs = llm.generate(["What is machine learning?"], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible API

vLLM provides an OpenAI-compatible endpoint.

from openai import OpenAI

# Point to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require API key by default
)

# Use exactly like OpenAI
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

Benchmarking vLLM vs Ollama

import time
import requests
import concurrent.futures

def benchmark_endpoint(url: str, payload: dict, num_requests: int = 50):
    """Benchmark an inference endpoint with concurrent requests."""

    def make_request():
        start = time.time()
        response = requests.post(url, json=payload)
        return time.time() - start, response.status_code

    start_total = time.time()

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        results = list(executor.map(lambda _: make_request(), range(num_requests)))

    total_time = time.time() - start_total
    latencies = [r[0] for r in results]
    success_count = sum(1 for r in results if r[1] == 200)

    return {
        "total_time": total_time,
        "requests_per_second": num_requests / total_time,
        "avg_latency": sum(latencies) / len(latencies),
        "success_rate": success_count / num_requests
    }

# Benchmark Ollama
ollama_result = benchmark_endpoint(
    "http://localhost:11434/api/generate",
    {"model": "llama3.2", "prompt": "Hello", "stream": False}
)

# Benchmark vLLM
vllm_result = benchmark_endpoint(
    "http://localhost:8000/v1/completions",
    {"model": "meta-llama/Llama-3.2-3B-Instruct", "prompt": "Hello", "max_tokens": 50}
)

print(f"Ollama: {ollama_result['requests_per_second']:.1f} req/s")
print(f"vLLM: {vllm_result['requests_per_second']:.1f} req/s")

PagedAttention Advantage

vLLM's key innovation is PagedAttention for efficient memory management.

┌─────────────────────────────────────────────────────────────────┐
│                   PagedAttention Memory Management              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Traditional (Ollama):                                          │
│  ┌────────────────────────────────────────┐                    │
│  │ Request 1: [██████████░░░░░░░░░░] 50%  │ Pre-allocated      │
│  │ Request 2: [████████████░░░░░░░░] 60%  │ wasted space       │
│  └────────────────────────────────────────┘                    │
│                                                                 │
│  PagedAttention (vLLM):                                         │
│  ┌────────────────────────────────────────┐                    │
│  │ [R1][R2][R1][R2][R1][R2]...           │ Dynamic paging      │
│  │ Memory shared efficiently              │ ~2-4x more users   │
│  └────────────────────────────────────────┘                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Docker Deployment with vLLM

# docker-compose.yml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.2-3B-Instruct
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

# Start vLLM with Docker
docker compose up -d

# Check logs
docker compose logs -f vllm

Hybrid Architecture

Use both Ollama and vLLM for different needs.

import os

class HybridInference:
    """Route requests between Ollama (dev) and vLLM (prod)."""

    def __init__(self):
        self.env = os.getenv("ENVIRONMENT", "development")
        self.ollama_url = "http://localhost:11434"
        self.vllm_url = "http://localhost:8000"

    def generate(self, prompt: str, **kwargs):
        if self.env == "production":
            return self._vllm_generate(prompt, **kwargs)
        return self._ollama_generate(prompt, **kwargs)

    def _ollama_generate(self, prompt: str, **kwargs):
        import ollama
        response = ollama.generate(model="llama3.2", prompt=prompt, **kwargs)
        return response["response"]

    def _vllm_generate(self, prompt: str, **kwargs):
        from openai import OpenAI
        client = OpenAI(base_url=f"{self.vllm_url}/v1", api_key="none")
        response = client.completions.create(
            model="meta-llama/Llama-3.2-3B-Instruct",
            prompt=prompt,
            max_tokens=kwargs.get("num_predict", 256)
        )
        return response.choices[0].text

# Usage
inference = HybridInference()
result = inference.generate("Explain transformers in ML")

Decision Matrix

Requirement	Ollama	vLLM
Quick setup	Best choice	More complex
Development	Best choice	Overkill
Production API	Possible	Best choice
High concurrency	Limited	Best choice
Memory efficiency	Good	Best choice
CPU inference	Supported	GPU only

vLLM unlocks production-grade performance. In the next module, we'll deploy local LLMs with Docker and scaling strategies. :::