A logic puzzle across three base models

A small logic puzzle is the cleanest place to see reasoning differences. Three constraints, three runners, work out the order. The output format is fixed — a one-paragraph reasoning, then a 1st/2nd/3rd assignment. None of the three models was given thinking mode for this run. This is the base-model performance.

The prompt

Three friends — Ali, Bilal, and Cyrus — finished a race in some order.

Ali did not finish last.

Bilal finished after Cyrus.

Cyrus did not finish first.

Who finished in each position? Show your one-paragraph reasoning, then give the final answer in this format: 1st: 2nd: 3rd:

Claude Sonnet 4.5

Let me work through this step by step. Since Cyrus did not finish first, he finished either 2nd or 3rd. Since Bilal finished after Cyrus, if Cyrus finished 2nd, then Bilal finished 3rd; if Cyrus finished 3rd, then Bilal would need to finish after 3rd, which is impossible. Therefore, Cyrus finished 2nd and Bilal finished 3rd. This means Ali finished 1st, which is consistent with the constraint that Ali did not finish last.

1st: Ali 2nd: Cyrus 3rd: Bilal

GPT-4o-mini

To determine the finishing order of Ali, Bilal, and Cyrus, we analyze the provided clues. First, since Ali did not finish last, he must have finished either first or second. Second, since Bilal finished after Cyrus, he cannot be in first place. Third, since Cyrus did not finish first, he must have finished either second or third. The only possible scenario that satisfies all these conditions is if Cyrus finished second, Ali finished first, and Bilal finished last. Thus, the final positions are:

1st: Ali 2nd: Cyrus 3rd: Bilal

Gemini 2.5 Flash

Cyrus could not have finished first, and Bilal finished after Cyrus,

Side-by-side scorecard

Logic puzzle: did the base model solve it?

Solved cleanly

Claude Sonnet 4.5

Got the right orderYes (Ali/Cyrus/Bilal)

Showed workingYes — case enumeration

Used 1st/2nd/3rd formatYes

Reasoning styleConstraint elimination

TruncatedNo

Pros

Picks most-constrained variable first
Shows the impossible case

Solved indirectly

GPT-4o-mini

Got the right orderYes (Ali/Cyrus/Bilal)

Showed workingListed clues, less explicit deduction

Used 1st/2nd/3rd formatYes

Reasoning styleConstraint listing

TruncatedNo

Cons

Hides the actual deduction step
Right answer, hand-wavier explanation

No answer

Gemini 2.5 Flash

Got the right orderDid not answer

Showed workingTwo clauses, then stop

Used 1st/2nd/3rd formatNo

Reasoning styleN/A

TruncatedYes

Cons

Stopped after restating two clues
Not a candidate for reasoning tasks

What this tells us

Claude and GPT-4o-mini both arrived at the correct answer (Ali / Cyrus / Bilal) and both produced the answer in the requested format. They got there via slightly different reasoning paths.

Claude's reasoning is constraint-elimination: pick the most-constrained variable first (Cyrus, who has two explicit constraints), enumerate the cases, eliminate the impossible one. This is the textbook approach to small constraint-satisfaction problems and Claude executes it crisply.

GPT-4o-mini's reasoning is constraint-listing: gather what each constraint implies, narrow the search space, identify the only consistent assignment. The chain is correct but slightly indirect — "the only possible scenario that satisfies all these conditions" is doing more work than the explanation makes clear. GPT got the answer; the explanation hides the actual deduction.

Gemini Flash truncated. Again. Two clauses in, no answer, no format. This is a base-model run, not a thinking-mode run, but Flash is producing nothing usable for any reasoning task we have shown so far. The thinking mode is supposed to fix this — Gemini 2.5 Pro with thinking_config enabled is a different beast. But Flash on its own does not have the headroom for a constrained reasoning problem.

The takeaway for Hagar's CTO question: if your task involves any kind of multi-step deduction, even a tiny one like this, Claude and GPT-4o-mini are both viable on the base model. Gemini Flash is not. The next lesson asks: should you upgrade to thinking mode anyway? When does the latency and cost pay back?

Captured 2026-04-27 from Claude Sonnet 4.5, GPT-4o-mini, and Gemini 2.5 Flash. Re-runs may differ slightly.

Next: when extended thinking actually pays back the latency cost. :::

The prompt

Claude Sonnet 4.5

GPT-4o-mini

Gemini 2.5 Flash

Side-by-side scorecard

Logic puzzle: did the base model solve it?

Claude Sonnet 4.5

GPT-4o-mini

Gemini 2.5 Flash

What this tells us

Quiz

Stay on the Nerd Track