Why Local LLMs?
Local vs API Tradeoffs
3 min read
Choosing between local and cloud LLMs isn't binary. Understanding the tradeoffs helps you make the right decision for each project.
The Decision Matrix
┌─────────────────────────────────────────────────────────────────┐
│ Local vs API Decision Matrix │
├───────────────────┬──────────────────┬──────────────────────────┤
│ Factor │ Favors Local │ Favors Cloud API │
├───────────────────┼──────────────────┼──────────────────────────┤
│ Data sensitivity │ High/Regulated │ Public/Non-sensitive │
│ Volume │ High (>10K/day) │ Low (<1K/day) │
│ Latency needs │ Consistent <500ms│ Variable OK │
│ Budget │ CapEx available │ Prefer OpEx │
│ Team expertise │ ML ops capable │ Limited ML knowledge │
│ Model requirements│ Standard tasks │ Cutting-edge needed │
│ Deployment │ On-premise/Edge │ Cloud-native │
└───────────────────┴──────────────────┴──────────────────────────┘
Detailed Comparison
Model Capabilities
# Task: Complex multi-step reasoning
# Cloud APIs still lead here
# GPT-4 - Excellent at complex reasoning
response = openai.chat.completions.create(
model="gpt-4-turbo",
messages=[{
"role": "user",
"content": """Analyze this legal contract for:
1. Potential liabilities
2. Unusual clauses
3. Missing standard provisions
4. Recommendations for negotiation"""
}]
)
# Llama 3.2 70B - Very good, but may need more guidance
response = ollama.chat(
model="llama3.2:70b",
messages=[{
"role": "user",
"content": """You are a contract analyst.
Think step by step.
Analyze this legal contract for:
1. Potential liabilities
2. Unusual clauses
3. Missing standard provisions
4. Recommendations for negotiation
Contract: ..."""
}]
)
Hardware Requirements
Model Size to Hardware Mapping:
───────────────────────────────────────────────────────
Model Size │ VRAM Needed │ Example Hardware
───────────────────────────────────────────────────────
1-3B │ 2-4 GB │ Any modern laptop
7-8B │ 6-8 GB │ RTX 3060, M1 Pro
13B │ 10-12 GB │ RTX 3080, M2 Pro
30-34B │ 20-24 GB │ RTX 4090, M2 Max
70B │ 40-48 GB │ A100 40GB, 2x RTX 4090
70B (Q4) │ 35-40 GB │ M3 Max 64GB, A6000
───────────────────────────────────────────────────────
For reference:
- MacBook Air M3: Up to 8B models comfortably
- MacBook Pro M3 Max: Up to 70B quantized
- Gaming PC (RTX 4090): Up to 30B full, 70B quantized
- Cloud GPU (A100): Any model size
Cost Analysis Over Time
5-Year Total Cost of Ownership:
────────────────────────────────────────────
Scenario A: 50,000 queries/day
Cloud API (GPT-4 Turbo):
├── Year 1: $90,000
├── Year 2: $90,000
├── Year 3: $90,000
├── Year 4: $90,000
├── Year 5: $90,000
└── Total: $450,000
Local LLM (4x A100 cluster):
├── Year 1: $60,000 (hardware) + $12,000 (ops)
├── Year 2: $12,000
├── Year 3: $12,000 + $30,000 (upgrade)
├── Year 4: $12,000
├── Year 5: $12,000
└── Total: $150,000
Savings: $300,000 (67% reduction)
Hybrid Architecture Patterns
The best systems often combine both:
from enum import Enum
class ModelRouter:
"""Route requests to local or cloud based on requirements."""
def __init__(self):
self.local_client = ollama
self.cloud_client = openai.OpenAI()
def route(self, task: str, data_sensitivity: str,
complexity: str) -> str:
"""
Determine optimal model for the request.
Args:
task: Description of the task
data_sensitivity: "public", "internal", "confidential"
complexity: "simple", "moderate", "complex"
"""
# Confidential data = always local
if data_sensitivity == "confidential":
return "local"
# Complex tasks with non-sensitive data = cloud
if complexity == "complex" and data_sensitivity == "public":
return "cloud"
# Everything else = local (cost savings)
return "local"
def query(self, prompt: str, **kwargs):
route = self.route(**kwargs)
if route == "local":
return self.local_client.chat(
model="llama3.2:8b",
messages=[{"role": "user", "content": prompt}]
)
else:
return self.cloud_client.chat.completions.create(
model="gpt-4-turbo",
messages=[{"role": "user", "content": prompt}]
)
Common Hybrid Patterns
Pattern 1: Sensitive Data Gateway
┌─────────────────────────────────────────┐
│ User Request │
└───────────────────┬─────────────────────┘
▼
┌─────────────────────────────────────────┐
│ Data Classifier │
│ (PII detection, sensitivity) │
└───────────────────┬─────────────────────┘
▼
┌─────────┴─────────┐
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Sensitive │ │ Public │
│ Data │ │ Data │
└─────┬─────┘ └─────┬─────┘
│ │
┌─────▼─────┐ ┌─────▼─────┐
│ Local │ │ Cloud │
│ Llama │ │ GPT-4 │
└───────────┘ └───────────┘
Pattern 2: Complexity Escalation
┌─────────────────────────────────────────┐
│ Initial Request │
└───────────────────┬─────────────────────┘
▼
┌─────────────────────────────────────────┐
│ Local LLM (Fast, Cheap) │
│ Llama 3.2 8B │
└───────────────────┬─────────────────────┘
▼
┌─────────────────────────────────────────┐
│ Confidence Check │
│ "Is the answer reliable enough?" │
└───────────────────┬─────────────────────┘
┌─────────┴─────────┐
High│ │Low
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ Return Answer │ │ Escalate to │
│ │ │ GPT-4 / Claude │
└─────────────────┘ └─────────────────┘
Key Takeaways
| Decision Factor | Local Choice | Cloud Choice |
|---|---|---|
| Data leaves network | Never acceptable | Acceptable |
| >10K queries/day | Cost effective | Expensive |
| Need GPT-4+ quality | Use 70B+ models | Best option |
| Latency SLA <500ms | Achievable | Harder to guarantee |
| No ML team | Ollama makes it easy | Easier to start |
The trend is clear: local LLMs are becoming the default for most production workloads, with cloud APIs reserved for cutting-edge capabilities. :::