System-prompt behavior — where each model listens hardest

Faithful summarisation — count the embellishments

5 min read

The instruction here is unusually demanding: "Do not add facts not in the source. Do not soften or strengthen claims." Three sentences exactly. The source is a small pilot study with explicit caveats from the author. This is a faithfulness test — does the model preserve the qualifiers, or does it polish them into something cleaner that the original did not say?

The prompt

Summarise this in exactly 3 sentences. Do not add facts not in the source. Do not soften or strengthen claims.

Source: "In a small pilot, 14 of 22 participants reported feeling more focused after switching from afternoon coffee to herbal tea. The pilot ran for 2 weeks. The author cautions the sample is small and not blinded; results may not generalise."

Claude Sonnet 4.5

In a small pilot study, 14 of 22 participants reported feeling more focused after switching from afternoon coffee to herbal tea. The pilot study lasted for 2 weeks. The author cautions that the sample is small and not blinded, and results may not generalise.

GPT-4o-mini

In a small pilot study, 14 of 22 participants reported increased focus after switching from afternoon coffee to herbal tea. The study lasted for 2 weeks. The author warns that the sample size is small and not blinded, so the results may not be generalizable.

Gemini 2.5 Flash

A small, two-week pilot study

Side-by-side scorecard

Faithful summary: did the qualifiers survive?

Literal

Claude Sonnet 4.5

Sentence count3 (target)
Numbers preservedYes (14 of 22, 2 weeks)
Verb shift in caveatNone ('cautions')
Claim strengthenedNo
TruncatedNo
Pros
  • Preserves author qualifiers
  • Safest for compliance digests
Drifts subtly

GPT-4o-mini

Sentence count3 (target)
Numbers preservedYes (14 of 22, 2 weeks)
Verb shift in caveat'cautions' → 'warns'
Claim strengthenedSlight ('feeling more focused' → 'increased focus')
TruncatedNo
Cons
  • Two small strengthenings
  • Risky at scale for research digests
Truncated

Gemini 2.5 Flash

Sentence count1 fragment
Numbers preservedPartial
Verb shift in caveatCaveat lost
Claim strengthenedCannot tell
TruncatedYes
Cons
  • Stopped at 'A small, two-week pilot study'
  • Unusable as a summary

What this tells us

Claude is the most literal of the three. Its summary is essentially a re-typesetting of the source: same numbers (14 of 22), same time (2 weeks), same author qualifiers ("small and not blinded", "may not generalise"). Three sentences exactly. No new claims.

GPT-4o-mini changes two phrases. "Reported feeling more focused" became "reported increased focus" — a tiny strengthening: "feeling more focused" is a subjective report, "increased focus" sounds slightly more like an objective measurement. The other change: "the author cautions" became "the author warns" — semantically close, but "warns" is stronger than "cautions". Both shifts are small. In a single summary they look harmless. Across thousands of summaries in a research-digest pipeline, they accumulate into a systematic over-strengthening of claims.

Gemini truncated. "A small, two-week pilot study" is six words and stops mid-fragment. There is no claim, no number, no caveat. It would be useless as a summary, and worse than useless if the downstream code thinks it has a real summary.

For a faithful-summarisation task — the kind that goes into research digests, regulatory filings, medical content review — Claude's literal preservation is the safest default. GPT works if you accept some semantic drift, which is fine for marketing summaries but dangerous for compliance work. Gemini Flash on this task is not a candidate at all.

Notice that all three got the count right when they said anything: 14 of 22, 2 weeks. The drift is not in the numbers. It is in the qualifiers — and the qualifiers are exactly what makes a summary faithful or unfaithful.

Captured 2026-04-27 from Claude Sonnet 4.5, GPT-4o-mini, and Gemini 2.5 Flash. Re-runs may differ slightly.

Next: a translation guide for moving prompts between vendors when one is failing. :::

Quiz

Module 2: System-prompt behavior across models

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.