Performance & Optimization
Quantization Explained
3 min read
Quantization reduces model precision to shrink memory usage while maintaining quality. It's how you run a 70B model on consumer hardware.
What is Quantization?
┌─────────────────────────────────────────────────────────────────┐
│ Quantization Precision Levels │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Original (FP32) ████████████████████████████████ 32 bits │
│ Half (FP16) ████████████████ 16 bits │
│ 8-bit (Q8) ████████ 8 bits │
│ 5-bit (Q5) █████ 5 bits │
│ 4-bit (Q4) ████ 4 bits │
│ │
│ Lower bits = Smaller size, faster inference, some quality loss │
└─────────────────────────────────────────────────────────────────┘
GGUF Format
Ollama uses GGUF (GPT-Generated Unified Format), the standard for quantized models.
# Model tags indicate quantization level
ollama pull llama3.2:8b # Default (usually Q4_K_M)
ollama pull llama3.2:8b-q4_0 # Basic 4-bit
ollama pull llama3.2:8b-q4_K_M # K-quant medium (recommended)
ollama pull llama3.2:8b-q5_K_M # Higher quality
ollama pull llama3.2:8b-q8_0 # Near-original quality
Quantization Types Comparison
┌─────────────────────────────────────────────────────────────────┐
│ Quantization Quality vs Size │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Type │ Bits │ Size (7B) │ Quality │ Use Case │
│ ─────────│──────│───────────│─────────│──────────────────── │
│ Q4_0 │ 4 │ ~3.5 GB │ Good │ Maximum compression │
│ Q4_K_M │ 4 │ ~4.0 GB │ Better │ Best balance ★ │
│ Q4_K_S │ 4 │ ~3.8 GB │ Good │ Smaller Q4 │
│ Q5_K_M │ 5 │ ~4.8 GB │ Great │ Quality priority │
│ Q5_K_S │ 5 │ ~4.5 GB │ Good │ Balanced Q5 │
│ Q6_K │ 6 │ ~5.5 GB │ Excellent│ Near-lossless │
│ Q8_0 │ 8 │ ~7.0 GB │ Excellent│ Minimal loss │
│ F16 │ 16 │ ~14 GB │ Original│ No compression │
│ │
└─────────────────────────────────────────────────────────────────┘
K-Quants Explained
K-quants (K_S, K_M, K_L) use mixed precision—important layers get more bits.
┌─────────────────────────────────────────────────────────────────┐
│ K-Quant Layer Strategy │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Layer Type Q4_K_S Q4_K_M Q4_K_L │
│ ───────────────── ────── ────── ────── │
│ Attention (Q,K,V) 4-bit 5-bit 6-bit │
│ Feed-forward 4-bit 4-bit 5-bit │
│ Output layers 5-bit 6-bit 6-bit │
│ │
│ _S = Small (faster), _M = Medium (balanced), _L = Large │
└─────────────────────────────────────────────────────────────────┘
Practical Quantization Selection
import ollama
def get_available_quantizations(model_name: str) -> list[str]:
"""List available quantization variants for a model."""
# Common quantization suffixes
quants = ["q4_0", "q4_K_S", "q4_K_M", "q5_0", "q5_K_S", "q5_K_M", "q8_0"]
available = []
for q in quants:
tag = f"{model_name}:{q}"
try:
# Check if model exists (would need to try pulling)
available.append(tag)
except:
pass
return available
def recommend_quantization(vram_gb: float, model_params_b: float) -> str:
"""Recommend quantization based on available VRAM."""
# Rough VRAM estimates per billion parameters
vram_per_b = {
"q4_K_M": 0.5, # ~0.5 GB per billion params
"q5_K_M": 0.625,
"q8_0": 1.0,
"f16": 2.0,
}
recommendations = []
for quant, ratio in vram_per_b.items():
needed = model_params_b * ratio * 1.2 # 20% overhead
if needed <= vram_gb:
recommendations.append((quant, needed))
if recommendations:
# Return highest quality that fits
return recommendations[-1][0]
return "Model too large for available VRAM"
# Examples
print(recommend_quantization(8, 7)) # 8GB VRAM, 7B model
print(recommend_quantization(24, 34)) # 24GB VRAM, 34B model
Quality Impact Testing
import ollama
import time
def compare_quantizations(prompt: str, model_base: str, quants: list[str]):
"""Compare response quality across quantization levels."""
results = []
for quant in quants:
model = f"{model_base}:{quant}" if quant else model_base
try:
start = time.time()
response = ollama.generate(
model=model,
prompt=prompt,
options={"num_predict": 100}
)
elapsed = time.time() - start
results.append({
"quantization": quant or "default",
"response": response["response"][:200],
"tokens_per_second": response.get("eval_count", 0) / elapsed,
"time": elapsed
})
except Exception as e:
results.append({
"quantization": quant,
"error": str(e)
})
return results
# Compare Q4 vs Q8 quality
prompt = "Explain quantum computing in simple terms."
results = compare_quantizations(prompt, "llama3.2", ["8b-q4_K_M", "8b-q8_0"])
for r in results:
print(f"\n{r['quantization']}:")
print(f" Speed: {r.get('tokens_per_second', 0):.1f} tokens/sec")
print(f" Sample: {r.get('response', r.get('error', ''))[:100]}...")
When to Use Each Quantization
| Quantization | Use Case |
|---|---|
| Q4_K_M | Default choice, best size/quality balance |
| Q4_K_S | Memory constrained, need smallest 4-bit |
| Q5_K_M | Quality matters more than speed |
| Q8_0 | Maximum quality, have plenty of VRAM |
| F16 | Benchmarking, comparing to original |
Creating Custom Quantized Models
# Modelfile with quantization hint
FROM llama3.2
# System prompt for specialized use
SYSTEM You are a coding assistant.
# These parameters work well with Q4 quantization
PARAMETER temperature 0.7
PARAMETER top_p 0.9
# Create the custom model
ollama create coding-assistant -f Modelfile
# The resulting model inherits the base model's quantization
ollama show coding-assistant --modelfile
Key Takeaways
- Q4_K_M is the sweet spot for most use cases
- K-quants preserve quality in important layers
- Higher bits = better quality but more VRAM needed
- Test your specific use case - quality impact varies by task
Quantization makes large models accessible. Next, we'll explore runtime performance tuning to maximize speed. :::