Tone & instruction-following across models
Choosing the right model per task
Two prompts in, you have already seen the dialect signature. Claude is disciplined and sometimes verbose. GPT-4o-mini is friendly and over-volunteers. Gemini 2.5 Flash is fast and sometimes truncates. None of those is a verdict — they are starting points for how you decide which model goes on which task.
A decision tree you can use today
For every prompt your application sends, ask three questions in order:
-
How many hard constraints does the prompt have? If three or more (length cap, format lock, forbidden words, line counts), strongly prefer Claude or GPT-4o. Gemini Flash will fail one of the constraints often enough that you cannot rely on it without a verifier downstream.
-
What is the worst-case cost of a bad output? Customer-facing replies, legal copy, code that runs in production — these are high-cost. The latency tax of running a more disciplined model is cheaper than the cost of fixing a bad output. Internal-only summarisation, draft suggestions, autocomplete — these are low-cost. Speed matters more.
-
Is the output checked by another model or by code? If yes, you can lean cheap and fast. The verifier catches failures. If no, lean disciplined.
The dialect signatures so far
| Behaviour | Claude Sonnet 4.5 | GPT-4o-mini | Gemini 2.5 Flash |
|---|---|---|---|
| Hard-rule following (4-rule prompt) | All 4 | All 4 | 1 of 4 |
| Tone rewrite under word cap | Followed cap, removed pressure | Followed cap, added subject line | Truncated |
| Tendency to add unrequested fields | Low | Medium-high | Low (it adds nothing — and sometimes nothing is what it returns) |
| Verbosity default | Medium-high | Medium-high | Low |
| Latency on the prompts above | 2.9–3.2s | 2.4–3.3s | 1.7–1.9s |
Read the latency row again. Gemini is roughly half the wall time of the others on these tasks. That speed is real and useful — for the right task. The right task is the one where the constraint count is low and the cost of a failed output is also low.
What "the right model" actually means
Hagar's CTO framed his question as "Claude or GPT". That is the wrong frame. The right frame is: which model do we route which prompt to, and what is the fallback?
A real production setup might look like this. Tone rewrites for customer support → Claude. Bulk draft suggestions in an internal tool → GPT-4o-mini. Real-time autocomplete on a private dashboard → Gemini Flash, with a check that the output is non-empty and contains the required keywords. Each task gets its own model, and each task has a fallback when the primary fails.
That is the comparison report you will ship in the capstone.
Next module: where each model "listens hardest" — system prompt vs user message, and how loyally each holds the persona under pressure. :::
Sign in to rate