Local vs frontier — the prompt budget
When few-shot saves you on smaller models
Few-shot prompting (giving the model two or three worked examples before the actual task) was useful on frontier APIs and is essential on smaller open-weight models. The reason is mechanical: smaller models have weaker zero-shot generalisation. They reach the right answer when shown the pattern, not when described the pattern.
Open-weight 7B model: zero-shot vs 3-shot
Zero-shot (no examples)
- Output shape varies request-to-request
- Misses domain jargon
3-shot (3 worked examples)
- Format locks in
- Edge cases handled via tricky example
- Prompt cache amortises cost
Two situations where few-shot is the difference between work and not-work
Situation 1 — format compliance on a small model. A 7B-parameter Llama or Mistral asked "extract the entity in JSON" will sometimes produce JSON, sometimes prose, sometimes a numbered list. The same prompt with three worked examples — input → exact JSON output — produces JSON every time. The frontier API equivalent works zero-shot; the small open-weight model usually does not.
Situation 2 — domain-specific terminology. Smaller models default to general-purpose phrasing. If your task is "classify this support ticket as billing / technical / sales / other", and your tickets contain industry-specific jargon (e.g., "the merchant terminal won't auth", which is a billing issue in payments terminology), the model needs examples that establish the mapping. Three or four examples covering the trickier cases substantially lifts the accuracy.
How to construct effective few-shot examples
Four rules that hold across all open-weight models:
-
Cover the common cases first, then the tricky cases. If you have room for three examples, two should be straightforward and one should hit a known edge case. The model picks up the central pattern from the easy ones and the boundary from the hard one.
-
Match the input distribution. Examples should look like the real inputs you will receive — same length, same vocabulary, same level of polish or noise. A clean academic example does not transfer to messy customer support tickets.
-
Show the exact output format. If the output should be
{"category": "billing"}, your example output should be exactly that — same field name, same casing, same JSON structure. The model copies what it sees more literally than you expect. -
Use a separator the model recognises. A simple
---or###between examples works on Llama and Mistral. Some models preferInput: ... Output: ...framing. Test both on your task.
The cost of few-shot
Few-shot adds tokens. Three good examples can be 200-400 tokens each, so 600-1200 extra input tokens per request. On open-weight models running on your own GPU, those extra tokens cost compute, not API dollars. On a Llama 4 70B, the latency cost of three examples is roughly 50-100ms — usually acceptable.
If your application makes many similar requests, you can amortise the few-shot cost via prompt caching. Anthropic, OpenAI, and most open-weight inference servers (vLLM, TGI) support some form of prefix caching. If your few-shot block is the same across all requests for a given task, the model only pays the compute cost once and reuses the cached prefix. Latency drops, throughput rises.
When few-shot does not help
If the model is failing because it lacks the underlying capability — multi-step reasoning, long-context tracking, niche language — few-shot examples will not unlock that capability. They make the model imitate the pattern; they do not teach the skill. For those cases, escalate to a larger model (or to thinking mode on a frontier API), not to more examples.
Next: the fallback strategy — how to compose multiple models so one failure does not break your application. :::
Sign in to rate