Lesson 17 of 22

Performance & Optimization

Quantization Explained

3 min read

Quantization reduces model precision to shrink memory usage while maintaining quality. It's how you run a 70B model on consumer hardware.

What is Quantization?

┌─────────────────────────────────────────────────────────────────┐
│                   Quantization Precision Levels                 │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Original (FP32)     ████████████████████████████████  32 bits  │
│  Half (FP16)         ████████████████                  16 bits  │
│  8-bit (Q8)          ████████                          8 bits   │
│  5-bit (Q5)          █████                             5 bits   │
│  4-bit (Q4)          ████                              4 bits   │
│                                                                 │
│  Lower bits = Smaller size, faster inference, some quality loss │
└─────────────────────────────────────────────────────────────────┘

GGUF Format

Ollama uses GGUF (GPT-Generated Unified Format), the standard for quantized models.

# Model tags indicate quantization level
ollama pull llama3.2:8b        # Default (usually Q4_K_M)
ollama pull llama3.2:8b-q4_0   # Basic 4-bit
ollama pull llama3.2:8b-q4_K_M # K-quant medium (recommended)
ollama pull llama3.2:8b-q5_K_M # Higher quality
ollama pull llama3.2:8b-q8_0   # Near-original quality

Quantization Types Comparison

┌─────────────────────────────────────────────────────────────────┐
│                   Quantization Quality vs Size                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Type     │ Bits │ Size (7B) │ Quality │ Use Case              │
│  ─────────│──────│───────────│─────────│────────────────────   │
│  Q4_0     │ 4    │ ~3.5 GB   │ Good    │ Maximum compression   │
│  Q4_K_M   │ 4    │ ~4.0 GB   │ Better  │ Best balance ★        │
│  Q4_K_S   │ 4    │ ~3.8 GB   │ Good    │ Smaller Q4            │
│  Q5_K_M   │ 5    │ ~4.8 GB   │ Great   │ Quality priority      │
│  Q5_K_S   │ 5    │ ~4.5 GB   │ Good    │ Balanced Q5           │
│  Q6_K     │ 6    │ ~5.5 GB   │ Excellent│ Near-lossless        │
│  Q8_0     │ 8    │ ~7.0 GB   │ Excellent│ Minimal loss         │
│  F16      │ 16   │ ~14 GB    │ Original│ No compression        │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

K-Quants Explained

K-quants (K_S, K_M, K_L) use mixed precision—important layers get more bits.

┌─────────────────────────────────────────────────────────────────┐
│                   K-Quant Layer Strategy                        │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Layer Type           Q4_K_S    Q4_K_M    Q4_K_L               │
│  ─────────────────    ──────    ──────    ──────               │
│  Attention (Q,K,V)    4-bit     5-bit     6-bit                │
│  Feed-forward         4-bit     4-bit     5-bit                │
│  Output layers        5-bit     6-bit     6-bit                │
│                                                                 │
│  _S = Small (faster), _M = Medium (balanced), _L = Large       │
└─────────────────────────────────────────────────────────────────┘

Practical Quantization Selection

import ollama

def get_available_quantizations(model_name: str) -> list[str]:
    """List available quantization variants for a model."""
    # Common quantization suffixes
    quants = ["q4_0", "q4_K_S", "q4_K_M", "q5_0", "q5_K_S", "q5_K_M", "q8_0"]

    available = []
    for q in quants:
        tag = f"{model_name}:{q}"
        try:
            # Check if model exists (would need to try pulling)
            available.append(tag)
        except:
            pass

    return available

def recommend_quantization(vram_gb: float, model_params_b: float) -> str:
    """Recommend quantization based on available VRAM."""
    # Rough VRAM estimates per billion parameters
    vram_per_b = {
        "q4_K_M": 0.5,  # ~0.5 GB per billion params
        "q5_K_M": 0.625,
        "q8_0": 1.0,
        "f16": 2.0,
    }

    recommendations = []
    for quant, ratio in vram_per_b.items():
        needed = model_params_b * ratio * 1.2  # 20% overhead
        if needed <= vram_gb:
            recommendations.append((quant, needed))

    if recommendations:
        # Return highest quality that fits
        return recommendations[-1][0]
    return "Model too large for available VRAM"

# Examples
print(recommend_quantization(8, 7))   # 8GB VRAM, 7B model
print(recommend_quantization(24, 34)) # 24GB VRAM, 34B model

Quality Impact Testing

import ollama
import time

def compare_quantizations(prompt: str, model_base: str, quants: list[str]):
    """Compare response quality across quantization levels."""
    results = []

    for quant in quants:
        model = f"{model_base}:{quant}" if quant else model_base

        try:
            start = time.time()
            response = ollama.generate(
                model=model,
                prompt=prompt,
                options={"num_predict": 100}
            )
            elapsed = time.time() - start

            results.append({
                "quantization": quant or "default",
                "response": response["response"][:200],
                "tokens_per_second": response.get("eval_count", 0) / elapsed,
                "time": elapsed
            })
        except Exception as e:
            results.append({
                "quantization": quant,
                "error": str(e)
            })

    return results

# Compare Q4 vs Q8 quality
prompt = "Explain quantum computing in simple terms."
results = compare_quantizations(prompt, "llama3.2", ["8b-q4_K_M", "8b-q8_0"])

for r in results:
    print(f"\n{r['quantization']}:")
    print(f"  Speed: {r.get('tokens_per_second', 0):.1f} tokens/sec")
    print(f"  Sample: {r.get('response', r.get('error', ''))[:100]}...")

When to Use Each Quantization

Quantization Use Case
Q4_K_M Default choice, best size/quality balance
Q4_K_S Memory constrained, need smallest 4-bit
Q5_K_M Quality matters more than speed
Q8_0 Maximum quality, have plenty of VRAM
F16 Benchmarking, comparing to original

Creating Custom Quantized Models

# Modelfile with quantization hint
FROM llama3.2

# System prompt for specialized use
SYSTEM You are a coding assistant.

# These parameters work well with Q4 quantization
PARAMETER temperature 0.7
PARAMETER top_p 0.9
# Create the custom model
ollama create coding-assistant -f Modelfile

# The resulting model inherits the base model's quantization
ollama show coding-assistant --modelfile

Key Takeaways

  1. Q4_K_M is the sweet spot for most use cases
  2. K-quants preserve quality in important layers
  3. Higher bits = better quality but more VRAM needed
  4. Test your specific use case - quality impact varies by task

Quantization makes large models accessible. Next, we'll explore runtime performance tuning to maximize speed. :::

Quiz

Module 5: Performance & Optimization

Take Quiz