Lesson 16 of 22

Performance & Optimization

Hardware Requirements

3 min read

Understanding hardware requirements helps you choose the right setup for running local LLMs efficiently. Different models have vastly different needs.

Key Hardware Components

┌─────────────────────────────────────────────────────────────────┐
│                   Hardware Impact on LLM Performance            │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Component      │ Impact           │ What to Look For           │
│  ───────────────│──────────────────│───────────────────────     │
│  GPU VRAM       │ Model size limit │ 8GB+ for 7B, 24GB+ for 70B │
│  System RAM     │ Fallback memory  │ 16GB minimum, 32GB+ ideal  │
│  CPU            │ Prompt processing│ Modern multi-core          │
│  Storage        │ Model loading    │ SSD strongly recommended   │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

VRAM Requirements by Model Size

┌─────────────────────────────────────────────────────────────────┐
│                   VRAM Requirements (Quantized Q4)              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  Model Size     │ Q4 VRAM  │ Q8 VRAM  │ Example Models          │
│  ───────────────│──────────│──────────│────────────────────     │
│  1B parameters  │ ~1 GB    │ ~2 GB    │ llama3.2:1b, phi3:mini  │
│  3B parameters  │ ~2 GB    │ ~4 GB    │ llama3.2:3b             │
│  7-8B params    │ ~4-5 GB  │ ~8 GB    │ llama3.2:8b, mistral    │
│  13B parameters │ ~8 GB    │ ~14 GB   │ codellama:13b           │
│  34B parameters │ ~20 GB   │ ~36 GB   │ codellama:34b           │
│  70B parameters │ ~40 GB   │ ~70 GB   │ llama3.2:70b            │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Platform-Specific Guidance

Apple Silicon (M1/M2/M3/M4)

Apple Silicon is excellent for local LLMs due to unified memory architecture.

# Check your Mac's unified memory
system_profiler SPHardwareDataType | grep "Memory"

# Ollama automatically uses Metal (GPU) on Apple Silicon
ollama run llama3.2  # Uses GPU by default

Apple Silicon recommendations:

  • M1/M2 (8GB): Up to 7B models (Q4 quantization)
  • M1/M2 Pro (16GB): Up to 13B models comfortably
  • M1/M2 Max (32GB): Up to 34B models
  • M1/M2 Ultra (64GB+): 70B models possible

NVIDIA GPUs

# Check NVIDIA GPU and VRAM
nvidia-smi

# Example output shows available VRAM
# |   0  NVIDIA GeForce RTX 4090   |   00000000:01:00.0  On |   24576MiB |

NVIDIA recommendations:

  • RTX 3060 (12GB): 7B-13B models
  • RTX 3090/4090 (24GB): Up to 34B models (Q4)
  • A100 (40GB/80GB): Full 70B models

CPU-Only Mode

Running on CPU is possible but significantly slower.

# Force CPU-only mode (useful for testing)
CUDA_VISIBLE_DEVICES="" ollama run llama3.2:1b

# On Mac, disable Metal
OLLAMA_NO_METAL=1 ollama run llama3.2:1b

Memory Management

import subprocess
import json

def check_ollama_memory():
    """Check current Ollama memory usage."""
    result = subprocess.run(
        ["ollama", "ps"],
        capture_output=True,
        text=True
    )
    print(result.stdout)

def estimate_model_memory(params_billions: float, quantization: str = "Q4") -> dict:
    """Estimate memory requirements for a model."""
    # Approximate bytes per parameter by quantization
    bytes_per_param = {
        "Q4": 0.5,   # 4-bit: ~0.5 bytes/param
        "Q5": 0.625, # 5-bit: ~0.625 bytes/param
        "Q8": 1.0,   # 8-bit: ~1 byte/param
        "F16": 2.0,  # 16-bit: ~2 bytes/param
    }

    bpp = bytes_per_param.get(quantization, 0.5)
    model_size_gb = (params_billions * 1e9 * bpp) / (1024**3)

    # Add ~20% overhead for KV cache and runtime
    total_needed = model_size_gb * 1.2

    return {
        "model_size_gb": round(model_size_gb, 1),
        "recommended_vram_gb": round(total_needed, 1),
        "quantization": quantization
    }

# Example calculations
for size in [1, 3, 7, 13, 34, 70]:
    req = estimate_model_memory(size)
    print(f"{size}B model (Q4): ~{req['recommended_vram_gb']} GB VRAM needed")

Checking Your System

import platform
import subprocess
import os

def get_system_info():
    """Get relevant system information for LLM inference."""
    info = {
        "platform": platform.system(),
        "processor": platform.processor(),
        "python_version": platform.python_version(),
    }

    # Check for Apple Silicon
    if platform.system() == "Darwin":
        try:
            result = subprocess.run(
                ["sysctl", "-n", "hw.memsize"],
                capture_output=True, text=True
            )
            ram_bytes = int(result.stdout.strip())
            info["ram_gb"] = round(ram_bytes / (1024**3), 1)
            info["apple_silicon"] = platform.processor() == "arm"
        except:
            pass

    # Check for NVIDIA GPU
    try:
        result = subprocess.run(
            ["nvidia-smi", "--query-gpu=name,memory.total",
             "--format=csv,noheader"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            info["nvidia_gpu"] = result.stdout.strip()
    except FileNotFoundError:
        info["nvidia_gpu"] = None

    return info

# Check your system
system = get_system_info()
for key, value in system.items():
    print(f"{key}: {value}")

Recommendations Summary

Budget Hardware Best Models
Entry ($0) Existing laptop (8GB RAM) llama3.2:1b, phi3:mini
Budget ($500-1000) Mac Mini M2 (16GB) llama3.2:8b, mistral
Mid-range ($1500-2500) Mac Studio M2 Max (32GB) codellama:34b, mixtral
High-end ($3000+) RTX 4090 or Mac Studio Ultra llama3.2:70b

Hardware determines your model ceiling. In the next lesson, we'll explore how quantization lets you fit larger models in less memory. :::

Quiz

Module 5: Performance & Optimization

Take Quiz