Local vs frontier — the prompt budget
The fallback strategy — composing multiple models
A production application that uses one model has a single point of failure. The model can be down. The model can return junk on a specific input shape. The model can be priced out of a use case if the vendor adjusts their pricing. The mature pattern is to compose multiple models with explicit fallback logic.
The three patterns that work
Pattern 1 — primary + fallback. The simplest. Send the request to your preferred model. If the response is invalid (failed JSON parse, empty body, error response, response too short), retry on a second model. Optionally, after N failures on the primary, switch all traffic to the fallback for some cooldown period. This is the most common production setup.
async function classifyWithFallback(input: string) {
try {
const r = await callClaude(input);
if (isValid(r)) return r;
} catch {}
return await callGPT(input); // fallback
}
Pattern 2 — cheap-first, escalate. For tasks where most inputs are easy and a few are hard, route everything to a small model first. If the small model is uncertain (low confidence, malformed output, refused to answer), escalate to a larger model. This minimises cost on the easy 90% of requests and only pays for the expensive model on the hard 10%.
async function summariseEscalating(text: string) {
const small = await callMistral7B(text);
if (small.tokenCount < 50 || small.refusal) {
return await callClaude(text); // escalate
}
return small;
}
Pattern 3 — verifier-on-top. Two models on the same input. The first generates an answer. The second verifies the answer against the input. If the verifier rejects, regenerate. This pattern is overkill for tone rewrites but essential for code generation, financial extraction, or anything where wrong is expensive.
How to know which pattern fits
A short decision rule:
- High volume, small variance in input difficulty → Pattern 1 (primary + fallback) on a small frontier model.
- High volume, wide variance in input difficulty → Pattern 2 (cheap-first, escalate) with an open-weight model as the cheap option.
- Low volume, high cost of error → Pattern 3 (verifier-on-top) with a frontier model on both legs.
The patterns are stackable. A real production system might use Pattern 2 with Llama 4 70B as the cheap model and Claude Sonnet 4.5 as the escalation, plus Pattern 3 verifier on top of the Claude path for the highest-stakes 1% of requests.
What to log
Every multi-model setup needs three logging fields per request:
- Which model produced the final answer. Otherwise you cannot debug "why is this user complaining about a weird output?" — the user does not know which model answered.
- Why fallbacks fired (parse error, empty response, low confidence, manual override). This is your dataset for improving the routing logic over time.
- Cost per request. Per-token costs vary across models; per-request costs vary even more once you account for fallbacks. Log them so you can prove the savings to your CTO.
This logging is the foundation of the comparison report you will ship in the capstone. Hagar's report will not just say "we should use these models for these tasks". It will show the actual cost-per-request distribution before and after the routing change, the fallback fire rate, and the user-facing quality scores. Without logging, none of that is provable.
Next module: the capstone — port one prompt across 8 models and ship a real comparison report. :::
Sign in to rate