Local vs frontier — the prompt budget

The fallback strategy — composing multiple models

4 min read

A production application that uses one model has a single point of failure. The model can be down. The model can return junk on a specific input shape. The model can be priced out of a use case if the vendor adjusts their pricing. The mature pattern is to compose multiple models with explicit fallback logic.

The three patterns that work

Pattern 1 — primary + fallback. The simplest. Send the request to your preferred model. If the response is invalid (failed JSON parse, empty body, error response, response too short), retry on a second model. Optionally, after N failures on the primary, switch all traffic to the fallback for some cooldown period. This is the most common production setup.

async function classifyWithFallback(input: string) {
  try {
    const r = await callClaude(input);
    if (isValid(r)) return r;
  } catch {}
  return await callGPT(input); // fallback
}

Pattern 2 — cheap-first, escalate. For tasks where most inputs are easy and a few are hard, route everything to a small model first. If the small model is uncertain (low confidence, malformed output, refused to answer), escalate to a larger model. This minimises cost on the easy 90% of requests and only pays for the expensive model on the hard 10%.

async function summariseEscalating(text: string) {
  const small = await callMistral7B(text);
  if (small.tokenCount < 50 || small.refusal) {
    return await callClaude(text); // escalate
  }
  return small;
}

Pattern 3 — verifier-on-top. Two models on the same input. The first generates an answer. The second verifies the answer against the input. If the verifier rejects, regenerate. This pattern is overkill for tone rewrites but essential for code generation, financial extraction, or anything where wrong is expensive.

How to know which pattern fits

A short decision rule:

  • High volume, small variance in input difficulty → Pattern 1 (primary + fallback) on a small frontier model.
  • High volume, wide variance in input difficulty → Pattern 2 (cheap-first, escalate) with an open-weight model as the cheap option.
  • Low volume, high cost of error → Pattern 3 (verifier-on-top) with a frontier model on both legs.

The patterns are stackable. A real production system might use Pattern 2 with Llama 4 70B as the cheap model and Claude Sonnet 4.5 as the escalation, plus Pattern 3 verifier on top of the Claude path for the highest-stakes 1% of requests.

What to log

Every multi-model setup needs three logging fields per request:

  1. Which model produced the final answer. Otherwise you cannot debug "why is this user complaining about a weird output?" — the user does not know which model answered.
  2. Why fallbacks fired (parse error, empty response, low confidence, manual override). This is your dataset for improving the routing logic over time.
  3. Cost per request. Per-token costs vary across models; per-request costs vary even more once you account for fallbacks. Log them so you can prove the savings to your CTO.

This logging is the foundation of the comparison report you will ship in the capstone. Hagar's report will not just say "we should use these models for these tasks". It will show the actual cost-per-request distribution before and after the routing change, the fallback fire rate, and the user-facing quality scores. Without logging, none of that is provable.

Next module: the capstone — port one prompt across 8 models and ship a real comparison report. :::

Quiz

Module 5: Local vs frontier — the prompt budget

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.