When few-shot saves you on smaller models

Few-shot prompting (giving the model two or three worked examples before the actual task) was useful on frontier APIs and is essential on smaller open-weight models. The reason is mechanical: smaller models have weaker zero-shot generalisation. They reach the right answer when shown the pattern, not when described the pattern.

Open-weight 7B model: zero-shot vs 3-shot

Unreliable

Zero-shot (no examples)

Format complianceInconsistent — JSON, prose, list

Domain term recognitionFalls back to generic phrasing

Accuracy on tricky casesLow

Extra input tokens0

LatencyBaseline

Cons

Output shape varies request-to-request
Misses domain jargon

Production-ready

3-shot (3 worked examples)

Format complianceJSON every time

Domain term recognitionPicks up the mapping

Accuracy on tricky casesSubstantially higher

Extra input tokens+600 to +1200

Latency+50–100ms (cacheable)

Pros

Format locks in
Edge cases handled via tricky example
Prompt cache amortises cost

Two situations where few-shot is the difference between work and not-work

Situation 1 — format compliance on a small model. A 7B-parameter Llama or Mistral asked "extract the entity in JSON" will sometimes produce JSON, sometimes prose, sometimes a numbered list. The same prompt with three worked examples — input → exact JSON output — produces JSON every time. The frontier API equivalent works zero-shot; the small open-weight model usually does not.

Situation 2 — domain-specific terminology. Smaller models default to general-purpose phrasing. If your task is "classify this support ticket as billing / technical / sales / other", and your tickets contain industry-specific jargon (e.g., "the merchant terminal won't auth", which is a billing issue in payments terminology), the model needs examples that establish the mapping. Three or four examples covering the trickier cases substantially lifts the accuracy.

How to construct effective few-shot examples

Four rules that hold across all open-weight models:

Cover the common cases first, then the tricky cases. If you have room for three examples, two should be straightforward and one should hit a known edge case. The model picks up the central pattern from the easy ones and the boundary from the hard one.
Match the input distribution. Examples should look like the real inputs you will receive — same length, same vocabulary, same level of polish or noise. A clean academic example does not transfer to messy customer support tickets.
Show the exact output format. If the output should be {"category": "billing"}, your example output should be exactly that — same field name, same casing, same JSON structure. The model copies what it sees more literally than you expect.
Use a separator the model recognises. A simple --- or ### between examples works on Llama and Mistral. Some models prefer Input: ... Output: ... framing. Test both on your task.

The cost of few-shot

Few-shot adds tokens. Three good examples can be 200-400 tokens each, so 600-1200 extra input tokens per request. On open-weight models running on your own GPU, those extra tokens cost compute, not API dollars. On a Llama 4 70B, the latency cost of three examples is roughly 50-100ms — usually acceptable.

If your application makes many similar requests, you can amortise the few-shot cost via prompt caching. Anthropic, OpenAI, and most open-weight inference servers (vLLM, TGI) support some form of prefix caching. If your few-shot block is the same across all requests for a given task, the model only pays the compute cost once and reuses the cached prefix. Latency drops, throughput rises.

When few-shot does not help

If the model is failing because it lacks the underlying capability — multi-step reasoning, long-context tracking, niche language — few-shot examples will not unlock that capability. They make the model imitate the pattern; they do not teach the skill. For those cases, escalate to a larger model (or to thinking mode on a frontier API), not to more examples.

Next: the fallback strategy — how to compose multiple models so one failure does not break your application. :::