Local vs frontier — the prompt budget
Prompt budget — how long instructions degrade
Frontier APIs handle long, complex prompts well. You can give Claude a 5000-word system prompt with detailed rules, formatting requirements, and few-shot examples, and Claude will track most of it through to the final answer. Open-weight models, even the larger ones, handle this differently. Long prompts degrade in predictable ways.
What "degradation" looks like
Three patterns recur across the open-weight family:
Pattern 1 — last-instruction bias. When the prompt has many instructions, smaller open-weight models tend to weight the most recently mentioned instruction more heavily than the earliest one. If you list 10 rules in order and rule 1 is "always reply in 3 sentences", but rule 10 is "include an explanation", the model often follows rule 10 and produces 6 sentences. Frontier APIs are less prone to this — they treat the rule list more equally.
Pattern 2 — system-prompt fade. Open-weight models can lose track of system-prompt rules when the user message gets long. A 200-word user message with detailed input data may push the system prompt out of effective attention. Mitigation: repeat key rules in the user message ("Remember: respond only with JSON, no markdown."), or move the rules into the user message entirely.
Pattern 3 — format drift over conversation. In multi-turn chats, open-weight models drift away from the format you locked in turn 1. Frontier APIs hold the format better. Mitigation: re-state the format in your system prompt or in the latest user message every few turns.
Token efficiency matters more
On frontier APIs, you pay per token but you can afford to be verbose because the prompt-following is reliable. On open-weight models, every extra paragraph of instructions has a cost and a benefit, and the benefit drops faster.
A practical rule: for an 8B-parameter open-weight model, your prompt should be 1/3 to 1/2 the length of the equivalent prompt for Claude or GPT. Strip the polite framing. Strip the redundant restatements. Keep the format example, drop the format explanation. The model needs less prose; it needs tighter signal.
For 70B+ models, the gap closes. Llama 4 70B and Mistral Large 3 handle long prompts well enough to run something close to your frontier-prompt verbatim. The 8B and smaller variants do not.
Where Hagar's CTO question lands
If the team's per-prompt cost is the binding constraint, the right move is not "switch from Claude to GPT-4o-mini". The right move is "for the high-volume task, switch from Claude to a self-hosted Llama 4 70B with a tightened prompt".
The savings math at scale: assume 10 million requests per month, 1500 input tokens + 500 output tokens per request. On Claude Sonnet 4.5 the API bill is in the thousands of dollars per month for that volume. On a self-hosted Llama 4 70B running on a g5.12xlarge AWS instance (or equivalent), the bill is the GPU lease — typically a few hundred dollars per month for an instance that handles that throughput. The savings, even after engineering time to migrate, are real.
The risk is that 5-10% of those requests need the frontier model anyway. Build the fallback into your routing layer from day one. The next lesson covers when few-shot saves you, which is the technique that lets a smaller model handle a task it would otherwise fail.
Next: when few-shot examples save you on a smaller model. :::
Sign in to rate