Reasoning modes — when to think, when to skip
When extended thinking actually pays back
The previous lesson showed Claude and GPT-4o-mini both solved the small race-order puzzle on the base model in 3-4 seconds. Adding thinking mode to that prompt would have produced the same answer in 30+ seconds and cost more money. So thinking mode is not always a win. The question is when it is.
Where thinking mode pays back vs where it taxes you
Reasoning puzzle
- Multi-step deduction benefits most
- Verifiable correctness
Code review for prod
- High cost of wrong answer
- Latency not user-facing
Tone rewrite / summary
- Pure forward-pass tasks gain nothing
- Cost without benefit
Structured extraction
- Schema enforcement > thinking for this task
The four cases where thinking pays back
1. The base model gets it wrong. This is the most obvious case. If you can verify the answer (you have a known-correct test set, or the problem is mathematical), and the base model is wrong, escalating to thinking mode often fixes it. Internal benchmarks at the major labs show 10-30 percentage point lifts on competition math, multi-step coding, and constraint-satisfaction problems.
2. The task has a known reasoning failure mode. If your task is "estimate the answer to this Fermi problem", base models tend to skip the rigorous decomposition and write a plausible-sounding number. Thinking mode forces the decomposition. Same for "is this code correct?" review tasks where the base model can miss subtle bugs.
3. Latency is not user-facing. If the prompt runs in a background batch job at 3 AM, the 30-60 second latency penalty does not matter. Pay it. The cost difference per call is also typically irrelevant at low volumes — a research-summary digest job that runs once a day is fine to put on thinking mode if it improves quality.
4. The cost of a wrong answer dominates the cost of compute. If the model is making a recommendation that gets sent to a human reviewer or to a real-money decision, paying 5x more for the answer is cheap compared to the downstream cost of a bad recommendation. Code review for production systems, contract clause analysis, financial analysis — all places thinking mode pays back.
When thinking is wasted spend
Most production prompts. Tone rewrites do not get better with thinking; the model already knew how to make the email warmer. Classifications do not get better with thinking; the answer is already in the embedding space. Structured extractions where the data is right there in the input do not benefit. JSON formatting does not benefit.
A useful test before turning on thinking: is your existing failure rate driven by the model not "trying hard enough", or by the model not having information it needs? Thinking mode fixes the first case. It does not fix the second. If the right answer requires fresh data the model does not have, thinking will produce a more elaborate wrong answer, not a right one.
A pragmatic ladder
When you suspect a task needs more reasoning than the base model is giving:
- Try prompt-engineered chain-of-thought first. Add "think step by step" or "show your reasoning before the answer". Free, fast, often enough.
- Try few-shot. Two or three worked examples in the prompt teaches the pattern. Also free, also fast.
- Escalate to a larger base model. Claude Opus 4.5 instead of Sonnet 4.5. GPT-4o instead of GPT-4o-mini. Gemini 2.5 Pro instead of Flash. Higher per-call cost, but no thinking-mode latency tax.
- Then try thinking mode. Only if (1)–(3) did not move the needle. By this point you should have a benchmark you can compare against.
The order matters. Skipping straight to thinking mode is the most expensive way to find out you needed few-shot.
Next: creative writing — where thinking mode is sometimes a tax with no benefit. :::
Sign in to rate