System-prompt behavior — where each model listens hardest
Faithful summarisation — count the embellishments
The instruction here is unusually demanding: "Do not add facts not in the source. Do not soften or strengthen claims." Three sentences exactly. The source is a small pilot study with explicit caveats from the author. This is a faithfulness test — does the model preserve the qualifiers, or does it polish them into something cleaner that the original did not say?
The prompt
Summarise this in exactly 3 sentences. Do not add facts not in the source. Do not soften or strengthen claims.
Source: "In a small pilot, 14 of 22 participants reported feeling more focused after switching from afternoon coffee to herbal tea. The pilot ran for 2 weeks. The author cautions the sample is small and not blinded; results may not generalise."
Claude Sonnet 4.5
In a small pilot study, 14 of 22 participants reported feeling more focused after switching from afternoon coffee to herbal tea. The pilot study lasted for 2 weeks. The author cautions that the sample is small and not blinded, and results may not generalise.
GPT-4o-mini
In a small pilot study, 14 of 22 participants reported increased focus after switching from afternoon coffee to herbal tea. The study lasted for 2 weeks. The author warns that the sample size is small and not blinded, so the results may not be generalizable.
Gemini 2.5 Flash
A small, two-week pilot study
Side-by-side scorecard
Faithful summary: did the qualifiers survive?
Claude Sonnet 4.5
- Preserves author qualifiers
- Safest for compliance digests
GPT-4o-mini
- Two small strengthenings
- Risky at scale for research digests
Gemini 2.5 Flash
- Stopped at 'A small, two-week pilot study'
- Unusable as a summary
What this tells us
Claude is the most literal of the three. Its summary is essentially a re-typesetting of the source: same numbers (14 of 22), same time (2 weeks), same author qualifiers ("small and not blinded", "may not generalise"). Three sentences exactly. No new claims.
GPT-4o-mini changes two phrases. "Reported feeling more focused" became "reported increased focus" — a tiny strengthening: "feeling more focused" is a subjective report, "increased focus" sounds slightly more like an objective measurement. The other change: "the author cautions" became "the author warns" — semantically close, but "warns" is stronger than "cautions". Both shifts are small. In a single summary they look harmless. Across thousands of summaries in a research-digest pipeline, they accumulate into a systematic over-strengthening of claims.
Gemini truncated. "A small, two-week pilot study" is six words and stops mid-fragment. There is no claim, no number, no caveat. It would be useless as a summary, and worse than useless if the downstream code thinks it has a real summary.
For a faithful-summarisation task — the kind that goes into research digests, regulatory filings, medical content review — Claude's literal preservation is the safest default. GPT works if you accept some semantic drift, which is fine for marketing summaries but dangerous for compliance work. Gemini Flash on this task is not a candidate at all.
Notice that all three got the count right when they said anything: 14 of 22, 2 weeks. The drift is not in the numbers. It is in the qualifiers — and the qualifiers are exactly what makes a summary faithful or unfaithful.
Captured 2026-04-27 from Claude Sonnet 4.5, GPT-4o-mini, and Gemini 2.5 Flash. Re-runs may differ slightly.
Next: a translation guide for moving prompts between vendors when one is failing. :::
Sign in to rate