Tone & instruction-following across models

Choosing the right model per task

4 min read

Two prompts in, you have already seen the dialect signature. Claude is disciplined and sometimes verbose. GPT-4o-mini is friendly and over-volunteers. Gemini 2.5 Flash is fast and sometimes truncates. None of those is a verdict — they are starting points for how you decide which model goes on which task.

A decision tree you can use today

For every prompt your application sends, ask three questions in order:

  1. How many hard constraints does the prompt have? If three or more (length cap, format lock, forbidden words, line counts), strongly prefer Claude or GPT-4o. Gemini Flash will fail one of the constraints often enough that you cannot rely on it without a verifier downstream.

  2. What is the worst-case cost of a bad output? Customer-facing replies, legal copy, code that runs in production — these are high-cost. The latency tax of running a more disciplined model is cheaper than the cost of fixing a bad output. Internal-only summarisation, draft suggestions, autocomplete — these are low-cost. Speed matters more.

  3. Is the output checked by another model or by code? If yes, you can lean cheap and fast. The verifier catches failures. If no, lean disciplined.

The dialect signatures so far

BehaviourClaude Sonnet 4.5GPT-4o-miniGemini 2.5 Flash
Hard-rule following (4-rule prompt)All 4All 41 of 4
Tone rewrite under word capFollowed cap, removed pressureFollowed cap, added subject lineTruncated
Tendency to add unrequested fieldsLowMedium-highLow (it adds nothing — and sometimes nothing is what it returns)
Verbosity defaultMedium-highMedium-highLow
Latency on the prompts above2.9–3.2s2.4–3.3s1.7–1.9s

Read the latency row again. Gemini is roughly half the wall time of the others on these tasks. That speed is real and useful — for the right task. The right task is the one where the constraint count is low and the cost of a failed output is also low.

What "the right model" actually means

Hagar's CTO framed his question as "Claude or GPT". That is the wrong frame. The right frame is: which model do we route which prompt to, and what is the fallback?

A real production setup might look like this. Tone rewrites for customer support → Claude. Bulk draft suggestions in an internal tool → GPT-4o-mini. Real-time autocomplete on a private dashboard → Gemini Flash, with a check that the output is non-empty and contains the required keywords. Each task gets its own model, and each task has a fallback when the primary fails.

That is the comparison report you will ship in the capstone.

Next module: where each model "listens hardest" — system prompt vs user message, and how loyally each holds the persona under pressure. :::

Quiz

Module 1: Tone & instruction-following across models

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.