Lesson 18 of 22

Performance & Optimization

Performance Tuning

3 min read

Optimize Ollama's runtime parameters to maximize inference speed for your specific hardware and use case.

Key Performance Parameters

┌─────────────────────────────────────────────────────────────────┐
│                   Ollama Performance Parameters                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Parameter         │ Effect              │ Trade-off            │
│  ──────────────────│─────────────────────│───────────────────   │
│  num_ctx           │ Context window size │ Speed vs memory      │
│  num_gpu           │ GPU layers          │ Speed vs VRAM        │
│  num_thread        │ CPU threads         │ CPU inference speed  │
│  num_batch         │ Batch size          │ Throughput vs latency│
│  num_predict       │ Max tokens          │ Response length      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Context Length Optimization

Context length significantly impacts memory usage and speed.

import ollama

# Default context (2048 tokens) - fastest
response = ollama.generate(
    model="llama3.2",
    prompt="Hello!",
    options={"num_ctx": 2048}
)

# Extended context (8192 tokens) - uses more VRAM
response = ollama.generate(
    model="llama3.2",
    prompt="Long document here...",
    options={"num_ctx": 8192}
)

# Maximum context (131072 for some models) - slowest, most memory
response = ollama.generate(
    model="llama3.2",
    prompt="Very long document...",
    options={"num_ctx": 131072}
)

Memory scaling formula:

VRAM_needed ≈ model_size + (num_ctx × hidden_size × num_layers × 2 bytes)

GPU Layer Configuration

Control how many layers run on GPU vs CPU.

# Check current GPU usage
ollama ps

# Run with all layers on GPU (if VRAM allows)
OLLAMA_NUM_GPU=999 ollama run llama3.2

# Run with partial GPU offload (split between GPU and CPU)
OLLAMA_NUM_GPU=20 ollama run llama3.2:70b

# Force CPU-only
OLLAMA_NUM_GPU=0 ollama run llama3.2
import ollama

def test_gpu_layers(model: str, layer_counts: list[int]):
    """Test inference speed with different GPU layer counts."""
    import time

    results = []
    prompt = "Write a short poem about coding."

    for layers in layer_counts:
        # This would require setting env var before starting Ollama
        start = time.time()
        response = ollama.generate(
            model=model,
            prompt=prompt,
            options={"num_predict": 50}
        )
        elapsed = time.time() - start

        tokens = response.get("eval_count", 50)
        results.append({
            "gpu_layers": layers,
            "tokens_per_sec": tokens / elapsed
        })

    return results

Batch Size Tuning

import ollama
import time

def benchmark_batch_sizes(model: str, batch_sizes: list[int]):
    """Compare throughput with different batch sizes."""
    prompt = "Explain machine learning."
    results = []

    for batch in batch_sizes:
        start = time.time()
        response = ollama.generate(
            model=model,
            prompt=prompt,
            options={
                "num_batch": batch,
                "num_predict": 100
            }
        )
        elapsed = time.time() - start

        results.append({
            "batch_size": batch,
            "time": elapsed,
            "tokens_per_sec": response.get("eval_count", 100) / elapsed
        })

    return results

# Test different batch sizes
results = benchmark_batch_sizes("llama3.2", [128, 256, 512, 1024])
for r in results:
    print(f"Batch {r['batch_size']}: {r['tokens_per_sec']:.1f} tok/s")

Thread Configuration

# Set CPU threads (default: auto-detected)
OLLAMA_NUM_THREAD=8 ollama serve

# Optimal: usually num_physical_cores (not hyperthreads)
# On 8-core CPU with hyperthreading, use 8, not 16
import os
import multiprocessing

def get_optimal_threads():
    """Get recommended thread count for inference."""
    # Physical cores usually work better than logical
    logical = multiprocessing.cpu_count()

    # Estimate physical cores (rough heuristic)
    # Most consumer CPUs have 2 threads per core
    physical = logical // 2

    return {
        "logical_cores": logical,
        "physical_cores": physical,
        "recommended": physical
    }

print(get_optimal_threads())

Model Keep-Alive

Keep models loaded to avoid reload latency.

import ollama

# Keep model loaded indefinitely
response = ollama.generate(
    model="llama3.2",
    prompt="Quick question",
    keep_alive=-1  # Never unload
)

# Unload after 5 minutes of inactivity (default)
response = ollama.generate(
    model="llama3.2",
    prompt="Quick question",
    keep_alive="5m"
)

# Unload immediately after response
response = ollama.generate(
    model="llama3.2",
    prompt="One-off query",
    keep_alive=0  # Unload immediately
)
# Set default keep-alive time
OLLAMA_KEEP_ALIVE=1h ollama serve

Comprehensive Benchmark Script

import ollama
import time
import statistics

def full_benchmark(model: str, iterations: int = 5):
    """Run comprehensive performance benchmark."""
    prompts = [
        "What is 2+2?",  # Short
        "Explain quantum computing in detail.",  # Medium
        "Write a story about a robot learning to paint.",  # Long generation
    ]

    results = {
        "model": model,
        "prompt_processing": [],  # Time to first token
        "generation_speed": [],   # Tokens per second
        "total_time": []
    }

    for prompt in prompts:
        for _ in range(iterations):
            start = time.time()

            response = ollama.generate(
                model=model,
                prompt=prompt,
                options={"num_predict": 100}
            )

            elapsed = time.time() - start

            # Calculate metrics
            eval_count = response.get("eval_count", 100)
            prompt_eval_duration = response.get("prompt_eval_duration", 0) / 1e9
            eval_duration = response.get("eval_duration", elapsed * 1e9) / 1e9

            results["prompt_processing"].append(prompt_eval_duration)
            results["generation_speed"].append(eval_count / eval_duration if eval_duration > 0 else 0)
            results["total_time"].append(elapsed)

    # Aggregate results
    return {
        "model": model,
        "avg_prompt_processing_ms": statistics.mean(results["prompt_processing"]) * 1000,
        "avg_tokens_per_sec": statistics.mean(results["generation_speed"]),
        "avg_total_time_sec": statistics.mean(results["total_time"]),
        "iterations": iterations * len(prompts)
    }

# Run benchmark
benchmark = full_benchmark("llama3.2")
print(f"Model: {benchmark['model']}")
print(f"Prompt Processing: {benchmark['avg_prompt_processing_ms']:.1f}ms")
print(f"Generation Speed: {benchmark['avg_tokens_per_sec']:.1f} tokens/sec")
print(f"Avg Total Time: {benchmark['avg_total_time_sec']:.2f}s")

Environment Variables Summary

# Create optimized Ollama configuration
export OLLAMA_NUM_GPU=999        # Use all GPU layers
export OLLAMA_NUM_THREAD=8       # Match physical CPU cores
export OLLAMA_KEEP_ALIVE=1h      # Keep models loaded
export OLLAMA_MAX_LOADED_MODELS=2 # Allow 2 models in memory
export OLLAMA_FLASH_ATTENTION=1  # Enable flash attention (faster)

# Start Ollama with optimized settings
ollama serve

Performance Tips Summary

Goal Setting
Fastest inference Lower num_ctx, all GPU layers
Longer context Increase num_ctx, may need fewer GPU layers
Multiple users Increase num_batch, keep models loaded
Memory savings Reduce num_ctx, use Q4 quantization
Quick responses Set num_predict limit

Performance tuning is iterative. Next, we'll explore when to upgrade from Ollama to vLLM for production workloads. :::

Quiz

Module 5: Performance & Optimization

Take Quiz