Lesson 19 of 22

Performance & Optimization

vLLM for Production

3 min read

When you need maximum throughput and concurrent users, vLLM is the production-grade inference server. Learn when to upgrade from Ollama.

Ollama vs vLLM

┌─────────────────────────────────────────────────────────────────┐
│                   Ollama vs vLLM Comparison                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Aspect            │ Ollama         │ vLLM                      │
│  ──────────────────│────────────────│─────────────────────────  │
│  Setup             │ One command    │ Requires configuration    │
│  Throughput        │ Good           │ Excellent (2-4x higher)   │
│  Concurrent users  │ Limited        │ Designed for scale        │
│  Batching          │ Basic          │ Continuous batching       │
│  Memory efficiency │ Good           │ PagedAttention (better)   │
│  Model support     │ GGUF only      │ Many formats              │
│  Best for          │ Development    │ Production servers        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to Consider vLLM

Stay with Ollama if:

  • Single user or small team
  • Development and testing
  • Simplicity is priority
  • Memory-constrained environment

Move to vLLM if:

  • Multiple concurrent users (>5-10)
  • High throughput requirements
  • Production deployment
  • Need maximum tokens/second

Installing vLLM

# Install vLLM (requires NVIDIA GPU with CUDA)
pip install vllm

# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Running vLLM Server

# Start vLLM with a model from HuggingFace
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --port 8000

# With quantization (AWQ)
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-AWQ \
    --quantization awq \
    --port 8000

# With GPU memory limit
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --gpu-memory-utilization 0.9 \
    --port 8000

vLLM Configuration Options

# vllm_config.py
from vllm import LLM, SamplingParams

# Initialize vLLM engine
llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    tensor_parallel_size=1,      # GPUs to use
    gpu_memory_utilization=0.9,  # % of GPU memory
    max_model_len=4096,          # Max context length
    dtype="auto",                # Precision (auto, float16, bfloat16)
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# Generate
outputs = llm.generate(["What is machine learning?"], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible API

vLLM provides an OpenAI-compatible endpoint.

from openai import OpenAI

# Point to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require API key by default
)

# Use exactly like OpenAI
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

Benchmarking vLLM vs Ollama

import time
import requests
import concurrent.futures

def benchmark_endpoint(url: str, payload: dict, num_requests: int = 50):
    """Benchmark an inference endpoint with concurrent requests."""

    def make_request():
        start = time.time()
        response = requests.post(url, json=payload)
        return time.time() - start, response.status_code

    start_total = time.time()

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        results = list(executor.map(lambda _: make_request(), range(num_requests)))

    total_time = time.time() - start_total
    latencies = [r[0] for r in results]
    success_count = sum(1 for r in results if r[1] == 200)

    return {
        "total_time": total_time,
        "requests_per_second": num_requests / total_time,
        "avg_latency": sum(latencies) / len(latencies),
        "success_rate": success_count / num_requests
    }

# Benchmark Ollama
ollama_result = benchmark_endpoint(
    "http://localhost:11434/api/generate",
    {"model": "llama3.2", "prompt": "Hello", "stream": False}
)

# Benchmark vLLM
vllm_result = benchmark_endpoint(
    "http://localhost:8000/v1/completions",
    {"model": "meta-llama/Llama-3.2-3B-Instruct", "prompt": "Hello", "max_tokens": 50}
)

print(f"Ollama: {ollama_result['requests_per_second']:.1f} req/s")
print(f"vLLM: {vllm_result['requests_per_second']:.1f} req/s")

PagedAttention Advantage

vLLM's key innovation is PagedAttention for efficient memory management.

┌─────────────────────────────────────────────────────────────────┐
│                   PagedAttention Memory Management              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Traditional (Ollama):                                          │
│  ┌────────────────────────────────────────┐                    │
│  │ Request 1: [██████████░░░░░░░░░░] 50%  │ Pre-allocated      │
│  │ Request 2: [████████████░░░░░░░░] 60%  │ wasted space       │
│  └────────────────────────────────────────┘                    │
│                                                                 │
│  PagedAttention (vLLM):                                         │
│  ┌────────────────────────────────────────┐                    │
│  │ [R1][R2][R1][R2][R1][R2]...           │ Dynamic paging      │
│  │ Memory shared efficiently              │ ~2-4x more users   │
│  └────────────────────────────────────────┘                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Docker Deployment with vLLM

# docker-compose.yml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.2-3B-Instruct
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
# Start vLLM with Docker
docker compose up -d

# Check logs
docker compose logs -f vllm

Hybrid Architecture

Use both Ollama and vLLM for different needs.

import os

class HybridInference:
    """Route requests between Ollama (dev) and vLLM (prod)."""

    def __init__(self):
        self.env = os.getenv("ENVIRONMENT", "development")
        self.ollama_url = "http://localhost:11434"
        self.vllm_url = "http://localhost:8000"

    def generate(self, prompt: str, **kwargs):
        if self.env == "production":
            return self._vllm_generate(prompt, **kwargs)
        return self._ollama_generate(prompt, **kwargs)

    def _ollama_generate(self, prompt: str, **kwargs):
        import ollama
        response = ollama.generate(model="llama3.2", prompt=prompt, **kwargs)
        return response["response"]

    def _vllm_generate(self, prompt: str, **kwargs):
        from openai import OpenAI
        client = OpenAI(base_url=f"{self.vllm_url}/v1", api_key="none")
        response = client.completions.create(
            model="meta-llama/Llama-3.2-3B-Instruct",
            prompt=prompt,
            max_tokens=kwargs.get("num_predict", 256)
        )
        return response.choices[0].text

# Usage
inference = HybridInference()
result = inference.generate("Explain transformers in ML")

Decision Matrix

RequirementOllamavLLM
Quick setupBest choiceMore complex
DevelopmentBest choiceOverkill
Production APIPossibleBest choice
High concurrencyLimitedBest choice
Memory efficiencyGoodBest choice
CPU inferenceSupportedGPU only

vLLM unlocks production-grade performance. In the next module, we'll deploy local LLMs with Docker and scaling strategies. :::

Quick check: how does this lesson land for you?

Quiz

Module 5: Performance & Optimization

Take Quiz
FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.