Lesson 19 of 22

Performance & Optimization

vLLM for Production

3 min read

When you need maximum throughput and concurrent users, vLLM is the production-grade inference server. Learn when to upgrade from Ollama.

Ollama vs vLLM

┌─────────────────────────────────────────────────────────────────┐
│                   Ollama vs vLLM Comparison                     │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Aspect            │ Ollama         │ vLLM                      │
│  ──────────────────│────────────────│─────────────────────────  │
│  Setup             │ One command    │ Requires configuration    │
│  Throughput        │ Good           │ Excellent (2-4x higher)   │
│  Concurrent users  │ Limited        │ Designed for scale        │
│  Batching          │ Basic          │ Continuous batching       │
│  Memory efficiency │ Good           │ PagedAttention (better)   │
│  Model support     │ GGUF only      │ Many formats              │
│  Best for          │ Development    │ Production servers        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

When to Consider vLLM

Stay with Ollama if:

  • Single user or small team
  • Development and testing
  • Simplicity is priority
  • Memory-constrained environment

Move to vLLM if:

  • Multiple concurrent users (>5-10)
  • High throughput requirements
  • Production deployment
  • Need maximum tokens/second

Installing vLLM

# Install vLLM (requires NVIDIA GPU with CUDA)
pip install vllm

# Or with specific CUDA version
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu121

Running vLLM Server

# Start vLLM with a model from HuggingFace
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --port 8000

# With quantization (AWQ)
python -m vllm.entrypoints.openai.api_server \
    --model TheBloke/Llama-2-7B-AWQ \
    --quantization awq \
    --port 8000

# With GPU memory limit
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --gpu-memory-utilization 0.9 \
    --port 8000

vLLM Configuration Options

# vllm_config.py
from vllm import LLM, SamplingParams

# Initialize vLLM engine
llm = LLM(
    model="meta-llama/Llama-3.2-3B-Instruct",
    tensor_parallel_size=1,      # GPUs to use
    gpu_memory_utilization=0.9,  # % of GPU memory
    max_model_len=4096,          # Max context length
    dtype="auto",                # Precision (auto, float16, bfloat16)
)

# Configure sampling
sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=256,
)

# Generate
outputs = llm.generate(["What is machine learning?"], sampling_params)
print(outputs[0].outputs[0].text)

OpenAI-Compatible API

vLLM provides an OpenAI-compatible endpoint.

from openai import OpenAI

# Point to vLLM server
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"  # vLLM doesn't require API key by default
)

# Use exactly like OpenAI
response = client.chat.completions.create(
    model="meta-llama/Llama-3.2-3B-Instruct",
    messages=[
        {"role": "user", "content": "Explain quantum computing"}
    ],
    temperature=0.7,
    max_tokens=256
)

print(response.choices[0].message.content)

Benchmarking vLLM vs Ollama

import time
import requests
import concurrent.futures

def benchmark_endpoint(url: str, payload: dict, num_requests: int = 50):
    """Benchmark an inference endpoint with concurrent requests."""

    def make_request():
        start = time.time()
        response = requests.post(url, json=payload)
        return time.time() - start, response.status_code

    start_total = time.time()

    with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
        results = list(executor.map(lambda _: make_request(), range(num_requests)))

    total_time = time.time() - start_total
    latencies = [r[0] for r in results]
    success_count = sum(1 for r in results if r[1] == 200)

    return {
        "total_time": total_time,
        "requests_per_second": num_requests / total_time,
        "avg_latency": sum(latencies) / len(latencies),
        "success_rate": success_count / num_requests
    }

# Benchmark Ollama
ollama_result = benchmark_endpoint(
    "http://localhost:11434/api/generate",
    {"model": "llama3.2", "prompt": "Hello", "stream": False}
)

# Benchmark vLLM
vllm_result = benchmark_endpoint(
    "http://localhost:8000/v1/completions",
    {"model": "meta-llama/Llama-3.2-3B-Instruct", "prompt": "Hello", "max_tokens": 50}
)

print(f"Ollama: {ollama_result['requests_per_second']:.1f} req/s")
print(f"vLLM: {vllm_result['requests_per_second']:.1f} req/s")

PagedAttention Advantage

vLLM's key innovation is PagedAttention for efficient memory management.

┌─────────────────────────────────────────────────────────────────┐
│                   PagedAttention Memory Management              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Traditional (Ollama):                                          │
│  ┌────────────────────────────────────────┐                    │
│  │ Request 1: [██████████░░░░░░░░░░] 50%  │ Pre-allocated      │
│  │ Request 2: [████████████░░░░░░░░] 60%  │ wasted space       │
│  └────────────────────────────────────────┘                    │
│                                                                 │
│  PagedAttention (vLLM):                                         │
│  ┌────────────────────────────────────────┐                    │
│  │ [R1][R2][R1][R2][R1][R2]...           │ Dynamic paging      │
│  │ Memory shared efficiently              │ ~2-4x more users   │
│  └────────────────────────────────────────┘                    │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Docker Deployment with vLLM

# docker-compose.yml
version: '3.8'

services:
  vllm:
    image: vllm/vllm-openai:latest
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    command: >
      --model meta-llama/Llama-3.2-3B-Instruct
      --gpu-memory-utilization 0.9
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
# Start vLLM with Docker
docker compose up -d

# Check logs
docker compose logs -f vllm

Hybrid Architecture

Use both Ollama and vLLM for different needs.

import os

class HybridInference:
    """Route requests between Ollama (dev) and vLLM (prod)."""

    def __init__(self):
        self.env = os.getenv("ENVIRONMENT", "development")
        self.ollama_url = "http://localhost:11434"
        self.vllm_url = "http://localhost:8000"

    def generate(self, prompt: str, **kwargs):
        if self.env == "production":
            return self._vllm_generate(prompt, **kwargs)
        return self._ollama_generate(prompt, **kwargs)

    def _ollama_generate(self, prompt: str, **kwargs):
        import ollama
        response = ollama.generate(model="llama3.2", prompt=prompt, **kwargs)
        return response["response"]

    def _vllm_generate(self, prompt: str, **kwargs):
        from openai import OpenAI
        client = OpenAI(base_url=f"{self.vllm_url}/v1", api_key="none")
        response = client.completions.create(
            model="meta-llama/Llama-3.2-3B-Instruct",
            prompt=prompt,
            max_tokens=kwargs.get("num_predict", 256)
        )
        return response.choices[0].text

# Usage
inference = HybridInference()
result = inference.generate("Explain transformers in ML")

Decision Matrix

Requirement Ollama vLLM
Quick setup Best choice More complex
Development Best choice Overkill
Production API Possible Best choice
High concurrency Limited Best choice
Memory efficiency Good Best choice
CPU inference Supported GPU only

vLLM unlocks production-grade performance. In the next module, we'll deploy local LLMs with Docker and scaling strategies. :::

Quiz

Module 5: Performance & Optimization

Take Quiz