Choosing the right model per task

Two prompts in, you have already seen the dialect signature. Claude is disciplined and sometimes verbose. GPT-4o-mini is friendly and over-volunteers. Gemini 2.5 Flash is fast and sometimes truncates. None of those is a verdict — they are starting points for how you decide which model goes on which task.

A decision tree you can use today

For every prompt your application sends, ask three questions in order:

How many hard constraints does the prompt have? If three or more (length cap, format lock, forbidden words, line counts), strongly prefer Claude or GPT-4o. Gemini Flash will fail one of the constraints often enough that you cannot rely on it without a verifier downstream.
What is the worst-case cost of a bad output? Customer-facing replies, legal copy, code that runs in production — these are high-cost. The latency tax of running a more disciplined model is cheaper than the cost of fixing a bad output. Internal-only summarisation, draft suggestions, autocomplete — these are low-cost. Speed matters more.
Is the output checked by another model or by code? If yes, you can lean cheap and fast. The verifier catches failures. If no, lean disciplined.

The dialect signatures so far

Behaviour	Claude Sonnet 4.5	GPT-4o-mini	Gemini 2.5 Flash
Hard-rule following (4-rule prompt)	All 4	All 4	1 of 4
Tone rewrite under word cap	Followed cap, removed pressure	Followed cap, added subject line	Truncated
Tendency to add unrequested fields	Low	Medium-high	Low (it adds nothing — and sometimes nothing is what it returns)
Verbosity default	Medium-high	Medium-high	Low
Latency on the prompts above	2.9–3.2s	2.4–3.3s	1.7–1.9s

Read the latency row again. Gemini is roughly half the wall time of the others on these tasks. That speed is real and useful — for the right task. The right task is the one where the constraint count is low and the cost of a failed output is also low.

What "the right model" actually means

Hagar's CTO framed his question as "Claude or GPT". That is the wrong frame. The right frame is: which model do we route which prompt to, and what is the fallback?

A real production setup might look like this. Tone rewrites for customer support → Claude. Bulk draft suggestions in an internal tool → GPT-4o-mini. Real-time autocomplete on a private dashboard → Gemini Flash, with a check that the output is non-empty and contains the required keywords. Each task gets its own model, and each task has a fallback when the primary fails.

That is the comparison report you will ship in the capstone.

Next module: where each model "listens hardest" — system prompt vs user message, and how loyally each holds the persona under pressure. :::

A decision tree you can use today

The dialect signatures so far

What "the right model" actually means

Quiz

Stay on the Nerd Track