Performance & Optimization
Performance Tuning
3 min read
Optimize Ollama's runtime parameters to maximize inference speed for your specific hardware and use case.
Key Performance Parameters
┌─────────────────────────────────────────────────────────────────┐
│ Ollama Performance Parameters │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Parameter │ Effect │ Trade-off │
│ ──────────────────│─────────────────────│─────────────────── │
│ num_ctx │ Context window size │ Speed vs memory │
│ num_gpu │ GPU layers │ Speed vs VRAM │
│ num_thread │ CPU threads │ CPU inference speed │
│ num_batch │ Batch size │ Throughput vs latency│
│ num_predict │ Max tokens │ Response length │
│ │
└─────────────────────────────────────────────────────────────────┘
Context Length Optimization
Context length significantly impacts memory usage and speed.
import ollama
# Default context (2048 tokens) - fastest
response = ollama.generate(
model="llama3.2",
prompt="Hello!",
options={"num_ctx": 2048}
)
# Extended context (8192 tokens) - uses more VRAM
response = ollama.generate(
model="llama3.2",
prompt="Long document here...",
options={"num_ctx": 8192}
)
# Maximum context (131072 for some models) - slowest, most memory
response = ollama.generate(
model="llama3.2",
prompt="Very long document...",
options={"num_ctx": 131072}
)
Memory scaling formula:
VRAM_needed ≈ model_size + (num_ctx × hidden_size × num_layers × 2 bytes)
GPU Layer Configuration
Control how many layers run on GPU vs CPU.
# Check current GPU usage
ollama ps
# Run with all layers on GPU (if VRAM allows)
OLLAMA_NUM_GPU=999 ollama run llama3.2
# Run with partial GPU offload (split between GPU and CPU)
OLLAMA_NUM_GPU=20 ollama run llama3.2:70b
# Force CPU-only
OLLAMA_NUM_GPU=0 ollama run llama3.2
import ollama
def test_gpu_layers(model: str, layer_counts: list[int]):
"""Test inference speed with different GPU layer counts."""
import time
results = []
prompt = "Write a short poem about coding."
for layers in layer_counts:
# This would require setting env var before starting Ollama
start = time.time()
response = ollama.generate(
model=model,
prompt=prompt,
options={"num_predict": 50}
)
elapsed = time.time() - start
tokens = response.get("eval_count", 50)
results.append({
"gpu_layers": layers,
"tokens_per_sec": tokens / elapsed
})
return results
Batch Size Tuning
import ollama
import time
def benchmark_batch_sizes(model: str, batch_sizes: list[int]):
"""Compare throughput with different batch sizes."""
prompt = "Explain machine learning."
results = []
for batch in batch_sizes:
start = time.time()
response = ollama.generate(
model=model,
prompt=prompt,
options={
"num_batch": batch,
"num_predict": 100
}
)
elapsed = time.time() - start
results.append({
"batch_size": batch,
"time": elapsed,
"tokens_per_sec": response.get("eval_count", 100) / elapsed
})
return results
# Test different batch sizes
results = benchmark_batch_sizes("llama3.2", [128, 256, 512, 1024])
for r in results:
print(f"Batch {r['batch_size']}: {r['tokens_per_sec']:.1f} tok/s")
Thread Configuration
# Set CPU threads (default: auto-detected)
OLLAMA_NUM_THREAD=8 ollama serve
# Optimal: usually num_physical_cores (not hyperthreads)
# On 8-core CPU with hyperthreading, use 8, not 16
import os
import multiprocessing
def get_optimal_threads():
"""Get recommended thread count for inference."""
# Physical cores usually work better than logical
logical = multiprocessing.cpu_count()
# Estimate physical cores (rough heuristic)
# Most consumer CPUs have 2 threads per core
physical = logical // 2
return {
"logical_cores": logical,
"physical_cores": physical,
"recommended": physical
}
print(get_optimal_threads())
Model Keep-Alive
Keep models loaded to avoid reload latency.
import ollama
# Keep model loaded indefinitely
response = ollama.generate(
model="llama3.2",
prompt="Quick question",
keep_alive=-1 # Never unload
)
# Unload after 5 minutes of inactivity (default)
response = ollama.generate(
model="llama3.2",
prompt="Quick question",
keep_alive="5m"
)
# Unload immediately after response
response = ollama.generate(
model="llama3.2",
prompt="One-off query",
keep_alive=0 # Unload immediately
)
# Set default keep-alive time
OLLAMA_KEEP_ALIVE=1h ollama serve
Comprehensive Benchmark Script
import ollama
import time
import statistics
def full_benchmark(model: str, iterations: int = 5):
"""Run comprehensive performance benchmark."""
prompts = [
"What is 2+2?", # Short
"Explain quantum computing in detail.", # Medium
"Write a story about a robot learning to paint.", # Long generation
]
results = {
"model": model,
"prompt_processing": [], # Time to first token
"generation_speed": [], # Tokens per second
"total_time": []
}
for prompt in prompts:
for _ in range(iterations):
start = time.time()
response = ollama.generate(
model=model,
prompt=prompt,
options={"num_predict": 100}
)
elapsed = time.time() - start
# Calculate metrics
eval_count = response.get("eval_count", 100)
prompt_eval_duration = response.get("prompt_eval_duration", 0) / 1e9
eval_duration = response.get("eval_duration", elapsed * 1e9) / 1e9
results["prompt_processing"].append(prompt_eval_duration)
results["generation_speed"].append(eval_count / eval_duration if eval_duration > 0 else 0)
results["total_time"].append(elapsed)
# Aggregate results
return {
"model": model,
"avg_prompt_processing_ms": statistics.mean(results["prompt_processing"]) * 1000,
"avg_tokens_per_sec": statistics.mean(results["generation_speed"]),
"avg_total_time_sec": statistics.mean(results["total_time"]),
"iterations": iterations * len(prompts)
}
# Run benchmark
benchmark = full_benchmark("llama3.2")
print(f"Model: {benchmark['model']}")
print(f"Prompt Processing: {benchmark['avg_prompt_processing_ms']:.1f}ms")
print(f"Generation Speed: {benchmark['avg_tokens_per_sec']:.1f} tokens/sec")
print(f"Avg Total Time: {benchmark['avg_total_time_sec']:.2f}s")
Environment Variables Summary
# Create optimized Ollama configuration
export OLLAMA_NUM_GPU=999 # Use all GPU layers
export OLLAMA_NUM_THREAD=8 # Match physical CPU cores
export OLLAMA_KEEP_ALIVE=1h # Keep models loaded
export OLLAMA_MAX_LOADED_MODELS=2 # Allow 2 models in memory
export OLLAMA_FLASH_ATTENTION=1 # Enable flash attention (faster)
# Start Ollama with optimized settings
ollama serve
Performance Tips Summary
| Goal | Setting |
|---|---|
| Fastest inference | Lower num_ctx, all GPU layers |
| Longer context | Increase num_ctx, may need fewer GPU layers |
| Multiple users | Increase num_batch, keep models loaded |
| Memory savings | Reduce num_ctx, use Q4 quantization |
| Quick responses | Set num_predict limit |
Performance tuning is iterative. Next, we'll explore when to upgrade from Ollama to vLLM for production workloads. :::