When extended thinking actually pays back

The previous lesson showed Claude and GPT-4o-mini both solved the small race-order puzzle on the base model in 3-4 seconds. Adding thinking mode to that prompt would have produced the same answer in 30+ seconds and cost more money. So thinking mode is not always a win. The question is when it is.

Where thinking mode pays back vs where it taxes you

Pays back

Reasoning puzzle

Base-model accuracySometimes wrong

Latency penalty30-60 sec

Quality lift10-30 percentage points

VerdictWorth it

Pros

Multi-step deduction benefits most
Verifiable correctness

Pays back

Code review for prod

Base-model accuracyMisses subtle bugs

Latency penalty30-90 sec

Quality liftCatches state-tracking bugs

VerdictWorth it for production

Pros

High cost of wrong answer
Latency not user-facing

Wasted spend

Tone rewrite / summary

Base-model accuracyAlready good

Latency penalty10-30 sec

Quality lift~zero

VerdictSkip thinking

Cons

Pure forward-pass tasks gain nothing
Cost without benefit

Wasted spend

Structured extraction

Base-model accuracyAlready high w/ tools

Latency penalty10-30 sec

Quality lift~zero

VerdictUse tool-use mode instead

Cons

Schema enforcement > thinking for this task

The four cases where thinking pays back

1. The base model gets it wrong. This is the most obvious case. If you can verify the answer (you have a known-correct test set, or the problem is mathematical), and the base model is wrong, escalating to thinking mode often fixes it. Internal benchmarks at the major labs show 10-30 percentage point lifts on competition math, multi-step coding, and constraint-satisfaction problems.

2. The task has a known reasoning failure mode. If your task is "estimate the answer to this Fermi problem", base models tend to skip the rigorous decomposition and write a plausible-sounding number. Thinking mode forces the decomposition. Same for "is this code correct?" review tasks where the base model can miss subtle bugs.

3. Latency is not user-facing. If the prompt runs in a background batch job at 3 AM, the 30-60 second latency penalty does not matter. Pay it. The cost difference per call is also typically irrelevant at low volumes — a research-summary digest job that runs once a day is fine to put on thinking mode if it improves quality.

4. The cost of a wrong answer dominates the cost of compute. If the model is making a recommendation that gets sent to a human reviewer or to a real-money decision, paying 5x more for the answer is cheap compared to the downstream cost of a bad recommendation. Code review for production systems, contract clause analysis, financial analysis — all places thinking mode pays back.

When thinking is wasted spend

Most production prompts. Tone rewrites do not get better with thinking; the model already knew how to make the email warmer. Classifications do not get better with thinking; the answer is already in the embedding space. Structured extractions where the data is right there in the input do not benefit. JSON formatting does not benefit.

A useful test before turning on thinking: is your existing failure rate driven by the model not "trying hard enough", or by the model not having information it needs? Thinking mode fixes the first case. It does not fix the second. If the right answer requires fresh data the model does not have, thinking will produce a more elaborate wrong answer, not a right one.

A pragmatic ladder

When you suspect a task needs more reasoning than the base model is giving:

Try prompt-engineered chain-of-thought first. Add "think step by step" or "show your reasoning before the answer". Free, fast, often enough.
Try few-shot. Two or three worked examples in the prompt teaches the pattern. Also free, also fast.
Escalate to a larger base model. Claude Opus 4.5 instead of Sonnet 4.5. GPT-4o instead of GPT-4o-mini. Gemini 2.5 Pro instead of Flash. Higher per-call cost, but no thinking-mode latency tax.
Then try thinking mode. Only if (1)–(3) did not move the needle. By this point you should have a benchmark you can compare against.

The order matters. Skipping straight to thinking mode is the most expensive way to find out you needed few-shot.

Next: creative writing — where thinking mode is sometimes a tax with no benefit. :::