Lesson 2 of 22

Why Local LLMs?

Local vs API Tradeoffs

3 min read

Choosing between local and cloud LLMs isn't binary. Understanding the tradeoffs helps you make the right decision for each project.

The Decision Matrix

┌─────────────────────────────────────────────────────────────────┐
│                   Local vs API Decision Matrix                   │
├───────────────────┬──────────────────┬──────────────────────────┤
│ Factor            │ Favors Local     │ Favors Cloud API         │
├───────────────────┼──────────────────┼──────────────────────────┤
│ Data sensitivity  │ High/Regulated   │ Public/Non-sensitive     │
│ Volume            │ High (>10K/day)  │ Low (<1K/day)            │
│ Latency needs     │ Consistent <500ms│ Variable OK              │
│ Budget            │ CapEx available  │ Prefer OpEx              │
│ Team expertise    │ ML ops capable   │ Limited ML knowledge     │
│ Model requirements│ Standard tasks   │ Cutting-edge needed      │
│ Deployment        │ On-premise/Edge  │ Cloud-native             │
└───────────────────┴──────────────────┴──────────────────────────┘

Detailed Comparison

Model Capabilities

# Task: Complex multi-step reasoning
# Cloud APIs still lead here

# GPT-4 - Excellent at complex reasoning
response = openai.chat.completions.create(
    model="gpt-4-turbo",
    messages=[{
        "role": "user",
        "content": """Analyze this legal contract for:
        1. Potential liabilities
        2. Unusual clauses
        3. Missing standard provisions
        4. Recommendations for negotiation"""
    }]
)

# Llama 3.2 70B - Very good, but may need more guidance
response = ollama.chat(
    model="llama3.2:70b",
    messages=[{
        "role": "user",
        "content": """You are a contract analyst.
        Think step by step.

        Analyze this legal contract for:
        1. Potential liabilities
        2. Unusual clauses
        3. Missing standard provisions
        4. Recommendations for negotiation

        Contract: ..."""
    }]
)

Hardware Requirements

Model Size to Hardware Mapping:
───────────────────────────────────────────────────────
Model Size    │ VRAM Needed  │ Example Hardware
───────────────────────────────────────────────────────
1-3B          │ 2-4 GB       │ Any modern laptop
7-8B          │ 6-8 GB       │ RTX 3060, M1 Pro
13B           │ 10-12 GB     │ RTX 3080, M2 Pro
30-34B        │ 20-24 GB     │ RTX 4090, M2 Max
70B           │ 40-48 GB     │ A100 40GB, 2x RTX 4090
70B (Q4)      │ 35-40 GB     │ M3 Max 64GB, A6000
───────────────────────────────────────────────────────

For reference:
- MacBook Air M3: Up to 8B models comfortably
- MacBook Pro M3 Max: Up to 70B quantized
- Gaming PC (RTX 4090): Up to 30B full, 70B quantized
- Cloud GPU (A100): Any model size

Cost Analysis Over Time

5-Year Total Cost of Ownership:
────────────────────────────────────────────

Scenario A: 50,000 queries/day

Cloud API (GPT-4 Turbo):
├── Year 1: $90,000
├── Year 2: $90,000
├── Year 3: $90,000
├── Year 4: $90,000
├── Year 5: $90,000
└── Total: $450,000

Local LLM (4x A100 cluster):
├── Year 1: $60,000 (hardware) + $12,000 (ops)
├── Year 2: $12,000
├── Year 3: $12,000 + $30,000 (upgrade)
├── Year 4: $12,000
├── Year 5: $12,000
└── Total: $150,000

Savings: $300,000 (67% reduction)

Hybrid Architecture Patterns

The best systems often combine both:

from enum import Enum

class ModelRouter:
    """Route requests to local or cloud based on requirements."""

    def __init__(self):
        self.local_client = ollama
        self.cloud_client = openai.OpenAI()

    def route(self, task: str, data_sensitivity: str,
              complexity: str) -> str:
        """
        Determine optimal model for the request.

        Args:
            task: Description of the task
            data_sensitivity: "public", "internal", "confidential"
            complexity: "simple", "moderate", "complex"
        """
        # Confidential data = always local
        if data_sensitivity == "confidential":
            return "local"

        # Complex tasks with non-sensitive data = cloud
        if complexity == "complex" and data_sensitivity == "public":
            return "cloud"

        # Everything else = local (cost savings)
        return "local"

    def query(self, prompt: str, **kwargs):
        route = self.route(**kwargs)

        if route == "local":
            return self.local_client.chat(
                model="llama3.2:8b",
                messages=[{"role": "user", "content": prompt}]
            )
        else:
            return self.cloud_client.chat.completions.create(
                model="gpt-4-turbo",
                messages=[{"role": "user", "content": prompt}]
            )

Common Hybrid Patterns

Pattern 1: Sensitive Data Gateway

┌─────────────────────────────────────────┐
│              User Request               │
└───────────────────┬─────────────────────┘
┌─────────────────────────────────────────┐
│           Data Classifier               │
│      (PII detection, sensitivity)       │
└───────────────────┬─────────────────────┘
          ┌─────────┴─────────┐
          │                   │
    ┌─────▼─────┐       ┌─────▼─────┐
    │ Sensitive │       │   Public  │
    │   Data    │       │   Data    │
    └─────┬─────┘       └─────┬─────┘
          │                   │
    ┌─────▼─────┐       ┌─────▼─────┐
    │  Local    │       │  Cloud    │
    │  Llama    │       │  GPT-4    │
    └───────────┘       └───────────┘

Pattern 2: Complexity Escalation

┌─────────────────────────────────────────┐
│           Initial Request               │
└───────────────────┬─────────────────────┘
┌─────────────────────────────────────────┐
│        Local LLM (Fast, Cheap)          │
│              Llama 3.2 8B               │
└───────────────────┬─────────────────────┘
┌─────────────────────────────────────────┐
│         Confidence Check                │
│   "Is the answer reliable enough?"      │
└───────────────────┬─────────────────────┘
          ┌─────────┴─────────┐
      High│                   │Low
          ▼                   ▼
┌─────────────────┐  ┌─────────────────┐
│  Return Answer  │  │  Escalate to    │
│                 │  │  GPT-4 / Claude │
└─────────────────┘  └─────────────────┘

Key Takeaways

Decision Factor Local Choice Cloud Choice
Data leaves network Never acceptable Acceptable
>10K queries/day Cost effective Expensive
Need GPT-4+ quality Use 70B+ models Best option
Latency SLA <500ms Achievable Harder to guarantee
No ML team Ollama makes it easy Easier to start

The trend is clear: local LLMs are becoming the default for most production workloads, with cloud APIs reserved for cutting-edge capabilities. :::

Quiz

Module 1: Why Local LLMs?

Take Quiz