Local vs frontier — the prompt budget
The open-weight landscape — what is actually shippable
So far this course has compared three frontier APIs: Claude, GPT, Gemini. They share a property — closed weights, hosted by the vendor, charged per token. The other half of the prompt-engineering landscape is open-weight models that you can download and run yourself, on your own infrastructure or on a third-party host that does not own the weights. Llama, Mistral, Qwen, DeepSeek, Phi. The list keeps growing.
You need to know this part of the landscape because for many of the tasks Hagar's CTO is paying Claude for, an open-weight model running on a small GPU could do the job for a fraction of the cost.
Model landscape — frontier closed → on-device
The four families that matter in 2026
| Family | Maker | Where it lands |
|---|---|---|
| Llama 4 (8B / 70B / 405B) | Meta | The "default" open-weight choice in 2026; widely supported |
| Mistral Large 3 + Mixtral | Mistral AI | Strong on European languages and instruction-following |
| Qwen 3 (small + reasoning variants) | Alibaba | Strong on Chinese, English, and code; permissive licensing |
| DeepSeek V3 / R1 | DeepSeek | Aggressive cost/quality ratio; the reasoning variant (R1) is the open-weight answer to o1 |
Phi-4 from Microsoft fills a smaller niche — extremely small models (3-4B) that run on phones and edge devices. Use Phi when latency budget is sub-100ms or when running on-device.
What "shippable" actually means
A model is shippable for your task if it satisfies all four:
-
Can it actually run on the hardware budget you have? A 70B model needs ~40GB of GPU VRAM at int4 quantisation, ~140GB at fp16. A 405B model is multi-GPU territory. An 8B model fits comfortably on a single consumer GPU. Pick the smallest model that does the job.
-
Does the prompt that worked on Claude / GPT / Gemini still work here? Often, no. Open-weight models tend to be more sensitive to prompt phrasing and to follow few-shot examples more literally. The next lessons cover this.
-
What is the licence? Llama is permissive but with restrictions on training-data extraction and on use by very large companies. Mistral's open-source models are Apache 2.0. Qwen is Apache 2.0 for most variants. DeepSeek varies. Read the actual licence before you ship — your legal team will care.
-
What is the inference cost end-to-end? A free model is not free if you pay AWS for the GPU. Compare against the per-token price of the frontier API for the same task volume. At low volumes, frontier APIs are cheaper. At high volumes (millions of requests per day), open-weight on your own infrastructure wins.
The honest tradeoff
Open-weight models in 2026 have closed roughly half the quality gap to frontier APIs. They are within 10-15% on most benchmarks. They are not within 10-15% on instruction-following discipline, on long-context reasoning, on tool use, or on edge cases like the 4-rule prompt from Module 1. For tasks where the quality bar is "produces an acceptable answer", open-weight is great. For tasks where the bar is "follows complex instructions every time", you are still on frontier APIs.
Hagar's pragmatic plan would be: keep Claude for customer-facing copy, keep GPT-4o-mini for high-volume tasks, but route the bulk-classification job (5 million records per night) to a self-hosted Llama 4 70B. The dollar savings on classification alone might pay for the GPU.
Next: how the prompt budget changes — open-weight models behave differently when prompts get long. :::
Sign in to rate