Performance & Optimization
Hardware Requirements
3 min read
Understanding hardware requirements helps you choose the right setup for running local LLMs efficiently. Different models have vastly different needs.
Key Hardware Components
┌─────────────────────────────────────────────────────────────────┐
│ Hardware Impact on LLM Performance │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Component │ Impact │ What to Look For │
│ ───────────────│──────────────────│─────────────────────── │
│ GPU VRAM │ Model size limit │ 8GB+ for 7B, 24GB+ for 70B │
│ System RAM │ Fallback memory │ 16GB minimum, 32GB+ ideal │
│ CPU │ Prompt processing│ Modern multi-core │
│ Storage │ Model loading │ SSD strongly recommended │
│ │
└─────────────────────────────────────────────────────────────────┘
VRAM Requirements by Model Size
┌─────────────────────────────────────────────────────────────────┐
│ VRAM Requirements (Quantized Q4) │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Model Size │ Q4 VRAM │ Q8 VRAM │ Example Models │
│ ───────────────│──────────│──────────│──────────────────── │
│ 1B parameters │ ~1 GB │ ~2 GB │ llama3.2:1b, phi3:mini │
│ 3B parameters │ ~2 GB │ ~4 GB │ llama3.2:3b │
│ 7-8B params │ ~4-5 GB │ ~8 GB │ llama3.2:8b, mistral │
│ 13B parameters │ ~8 GB │ ~14 GB │ codellama:13b │
│ 34B parameters │ ~20 GB │ ~36 GB │ codellama:34b │
│ 70B parameters │ ~40 GB │ ~70 GB │ llama3.2:70b │
│ │
└─────────────────────────────────────────────────────────────────┘
Platform-Specific Guidance
Apple Silicon (M1/M2/M3/M4)
Apple Silicon is excellent for local LLMs due to unified memory architecture.
# Check your Mac's unified memory
system_profiler SPHardwareDataType | grep "Memory"
# Ollama automatically uses Metal (GPU) on Apple Silicon
ollama run llama3.2 # Uses GPU by default
Apple Silicon recommendations:
- M1/M2 (8GB): Up to 7B models (Q4 quantization)
- M1/M2 Pro (16GB): Up to 13B models comfortably
- M1/M2 Max (32GB): Up to 34B models
- M1/M2 Ultra (64GB+): 70B models possible
NVIDIA GPUs
# Check NVIDIA GPU and VRAM
nvidia-smi
# Example output shows available VRAM
# | 0 NVIDIA GeForce RTX 4090 | 00000000:01:00.0 On | 24576MiB |
NVIDIA recommendations:
- RTX 3060 (12GB): 7B-13B models
- RTX 3090/4090 (24GB): Up to 34B models (Q4)
- A100 (40GB/80GB): Full 70B models
CPU-Only Mode
Running on CPU is possible but significantly slower.
# Force CPU-only mode (useful for testing)
CUDA_VISIBLE_DEVICES="" ollama run llama3.2:1b
# On Mac, disable Metal
OLLAMA_NO_METAL=1 ollama run llama3.2:1b
Memory Management
import subprocess
import json
def check_ollama_memory():
"""Check current Ollama memory usage."""
result = subprocess.run(
["ollama", "ps"],
capture_output=True,
text=True
)
print(result.stdout)
def estimate_model_memory(params_billions: float, quantization: str = "Q4") -> dict:
"""Estimate memory requirements for a model."""
# Approximate bytes per parameter by quantization
bytes_per_param = {
"Q4": 0.5, # 4-bit: ~0.5 bytes/param
"Q5": 0.625, # 5-bit: ~0.625 bytes/param
"Q8": 1.0, # 8-bit: ~1 byte/param
"F16": 2.0, # 16-bit: ~2 bytes/param
}
bpp = bytes_per_param.get(quantization, 0.5)
model_size_gb = (params_billions * 1e9 * bpp) / (1024**3)
# Add ~20% overhead for KV cache and runtime
total_needed = model_size_gb * 1.2
return {
"model_size_gb": round(model_size_gb, 1),
"recommended_vram_gb": round(total_needed, 1),
"quantization": quantization
}
# Example calculations
for size in [1, 3, 7, 13, 34, 70]:
req = estimate_model_memory(size)
print(f"{size}B model (Q4): ~{req['recommended_vram_gb']} GB VRAM needed")
Checking Your System
import platform
import subprocess
import os
def get_system_info():
"""Get relevant system information for LLM inference."""
info = {
"platform": platform.system(),
"processor": platform.processor(),
"python_version": platform.python_version(),
}
# Check for Apple Silicon
if platform.system() == "Darwin":
try:
result = subprocess.run(
["sysctl", "-n", "hw.memsize"],
capture_output=True, text=True
)
ram_bytes = int(result.stdout.strip())
info["ram_gb"] = round(ram_bytes / (1024**3), 1)
info["apple_silicon"] = platform.processor() == "arm"
except:
pass
# Check for NVIDIA GPU
try:
result = subprocess.run(
["nvidia-smi", "--query-gpu=name,memory.total",
"--format=csv,noheader"],
capture_output=True, text=True
)
if result.returncode == 0:
info["nvidia_gpu"] = result.stdout.strip()
except FileNotFoundError:
info["nvidia_gpu"] = None
return info
# Check your system
system = get_system_info()
for key, value in system.items():
print(f"{key}: {value}")
Recommendations Summary
| Budget | Hardware | Best Models |
|---|---|---|
| Entry ($0) | Existing laptop (8GB RAM) | llama3.2:1b, phi3:mini |
| Budget ($500-1000) | Mac Mini M2 (16GB) | llama3.2:8b, mistral |
| Mid-range ($1500-2500) | Mac Studio M2 Max (32GB) | codellama:34b, mixtral |
| High-end ($3000+) | RTX 4090 or Mac Studio Ultra | llama3.2:70b |
Hardware determines your model ceiling. In the next lesson, we'll explore how quantization lets you fit larger models in less memory. :::