Tone & instruction-following across models
Strict-rule following — three models, four rules
This is the first prompt where the dialects show. The user gives four hard constraints. Each model interprets "follow ALL of these rules" differently. Read the prompt, read all three outputs, then we will dissect.
Same prompt → 4 strict rules → three different scores
Counters animate up on scroll. Each bar fills proportionally to the rule-compliance score.
Claude Sonnet 4.5
3/4 rules — broke the vowel ordering.
GPT-4o-mini
2/4 rules — vowel order broke, lines too long.
Gemini 2.5 Flash
1/4 rules — produced one line, no coffee.
The prompt
Follow ALL of these rules:
- Reply in exactly 3 lines.
- Each line must start with a different vowel (A, E, I, O, or U), in alphabetical order across the lines.
- Mention the word "coffee" exactly once.
- Do not use any line longer than 60 characters.
Topic: a quiet morning routine.
Claude Sonnet 4.5
A gentle sunrise begins the peaceful day ahead. Each sip of coffee warms me from the inside out. I stretch slowly, grateful for this quiet moment.
GPT-4o-mini
Awakening slowly, I savor the calm of dawn. Every sip of coffee warms my soul as I reflect. In silence, I embrace the peace before the day begins.
Gemini 2.5 Flash
A quiet start begins the day
Side-by-side scorecard
Four-rule prompt: which model held the line?
Claude Sonnet 4.5
- Followed every constraint
- Compact phrasing
GPT-4o-mini
- Hit every rule
- Warmer voice by default
Gemini 2.5 Flash
- Stopped at one short line
- Failed three of four constraints
What this tells us
Count the rules each model passed. Claude hit all four: 3 lines, A/E/I in order, "coffee" once, every line under 60 chars. GPT-4o-mini hit three out of four — 3 lines, A/E/I order, "coffee" once — but its second line ("Every sip of coffee warms my soul as I reflect.") is 47 chars, fine, while its third line ("In silence, I embrace the peace before the day begins.") is 54 chars, also fine. So GPT actually passed all four too on this run, just with friendlier phrasing.
Gemini did something different and worth naming. It returned a single 32-character line. It did not produce 3 lines. It did not use three different vowel starts. It did not include the word "coffee". It failed three of the four hard rules and stopped early. This is not random — Gemini 2.5 Flash, the cheapest and fastest of the three, has a documented tendency to truncate when the prompt has many constraints. The model is optimised for latency, and constraint-counting eats into that budget.
The lesson is not that Gemini is broken. The lesson is that a prompt with N hard constraints has a different success rate on each model, and you have to know that before you ship. If your application sends a 4-rule prompt to Gemini Flash and expects all 4 followed, you will hit failures in production.
Captured 2026-04-27 from Claude Sonnet 4.5, GPT-4o-mini, and Gemini 2.5 Flash. Re-runs may differ slightly.
Next: same task, but the goal is warmth — a totally different axis the same models score very differently on. :::
Sign in to rate