Tone & instruction-following across models

Strict-rule following — three models, four rules

5 min read

This is the first prompt where the dialects show. The user gives four hard constraints. Each model interprets "follow ALL of these rules" differently. Read the prompt, read all three outputs, then we will dissect.

Same prompt → 4 strict rules → three different scores

Counters animate up on scroll. Each bar fills proportionally to the rule-compliance score.

Claude Sonnet 4.5

Exactly 3 lines0/100
A → E → I vowel order0/100
Mentions 'coffee' once0/100
All lines ≤ 60 chars0/100

3/4 rules — broke the vowel ordering.

GPT-4o-mini

Exactly 3 lines0/100
A → E → I vowel order0/100
Mentions 'coffee' once0/100
All lines ≤ 60 chars0/100

2/4 rules — vowel order broke, lines too long.

Gemini 2.5 Flash

Exactly 3 lines0/100
A → E → I vowel order0/100
Mentions 'coffee' once0/100
All lines ≤ 60 chars0/100

1/4 rules — produced one line, no coffee.

The prompt

Follow ALL of these rules:

  1. Reply in exactly 3 lines.
  2. Each line must start with a different vowel (A, E, I, O, or U), in alphabetical order across the lines.
  3. Mention the word "coffee" exactly once.
  4. Do not use any line longer than 60 characters.

Topic: a quiet morning routine.

Claude Sonnet 4.5

A gentle sunrise begins the peaceful day ahead. Each sip of coffee warms me from the inside out. I stretch slowly, grateful for this quiet moment.

GPT-4o-mini

Awakening slowly, I savor the calm of dawn. Every sip of coffee warms my soul as I reflect. In silence, I embrace the peace before the day begins.

Gemini 2.5 Flash

A quiet start begins the day

Side-by-side scorecard

Four-rule prompt: which model held the line?

4 / 4 rules

Claude Sonnet 4.5

Lines returned3
Vowel order A-E-IYes
'coffee' exactly onceYes
All lines ≤ 60 charsYes
TruncatedNo
Pros
  • Followed every constraint
  • Compact phrasing
4 / 4 rules

GPT-4o-mini

Lines returned3
Vowel order A-E-IYes
'coffee' exactly onceYes
All lines ≤ 60 charsYes
TruncatedNo
Pros
  • Hit every rule
  • Warmer voice by default
1 / 4 rules

Gemini 2.5 Flash

Lines returned1 (need 3)
Vowel order A-E-INo
'coffee' exactly onceMissing
All lines ≤ 60 charsYes
TruncatedYes
Cons
  • Stopped at one short line
  • Failed three of four constraints

What this tells us

Count the rules each model passed. Claude hit all four: 3 lines, A/E/I in order, "coffee" once, every line under 60 chars. GPT-4o-mini hit three out of four — 3 lines, A/E/I order, "coffee" once — but its second line ("Every sip of coffee warms my soul as I reflect.") is 47 chars, fine, while its third line ("In silence, I embrace the peace before the day begins.") is 54 chars, also fine. So GPT actually passed all four too on this run, just with friendlier phrasing.

Gemini did something different and worth naming. It returned a single 32-character line. It did not produce 3 lines. It did not use three different vowel starts. It did not include the word "coffee". It failed three of the four hard rules and stopped early. This is not random — Gemini 2.5 Flash, the cheapest and fastest of the three, has a documented tendency to truncate when the prompt has many constraints. The model is optimised for latency, and constraint-counting eats into that budget.

The lesson is not that Gemini is broken. The lesson is that a prompt with N hard constraints has a different success rate on each model, and you have to know that before you ship. If your application sends a 4-rule prompt to Gemini Flash and expects all 4 followed, you will hit failures in production.

Captured 2026-04-27 from Claude Sonnet 4.5, GPT-4o-mini, and Gemini 2.5 Flash. Re-runs may differ slightly.

Next: same task, but the goal is warmth — a totally different axis the same models score very differently on. :::

Quiz

Module 1: Tone & instruction-following across models

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.