Reasoning modes — when to think, when to skip

When extended thinking actually pays back

4 min read

The previous lesson showed Claude and GPT-4o-mini both solved the small race-order puzzle on the base model in 3-4 seconds. Adding thinking mode to that prompt would have produced the same answer in 30+ seconds and cost more money. So thinking mode is not always a win. The question is when it is.

Where thinking mode pays back vs where it taxes you

Pays back

Reasoning puzzle

Base-model accuracySometimes wrong
Latency penalty30-60 sec
Quality lift10-30 percentage points
VerdictWorth it
Pros
  • Multi-step deduction benefits most
  • Verifiable correctness
Pays back

Code review for prod

Base-model accuracyMisses subtle bugs
Latency penalty30-90 sec
Quality liftCatches state-tracking bugs
VerdictWorth it for production
Pros
  • High cost of wrong answer
  • Latency not user-facing
Wasted spend

Tone rewrite / summary

Base-model accuracyAlready good
Latency penalty10-30 sec
Quality lift~zero
VerdictSkip thinking
Cons
  • Pure forward-pass tasks gain nothing
  • Cost without benefit
Wasted spend

Structured extraction

Base-model accuracyAlready high w/ tools
Latency penalty10-30 sec
Quality lift~zero
VerdictUse tool-use mode instead
Cons
  • Schema enforcement > thinking for this task

The four cases where thinking pays back

1. The base model gets it wrong. This is the most obvious case. If you can verify the answer (you have a known-correct test set, or the problem is mathematical), and the base model is wrong, escalating to thinking mode often fixes it. Internal benchmarks at the major labs show 10-30 percentage point lifts on competition math, multi-step coding, and constraint-satisfaction problems.

2. The task has a known reasoning failure mode. If your task is "estimate the answer to this Fermi problem", base models tend to skip the rigorous decomposition and write a plausible-sounding number. Thinking mode forces the decomposition. Same for "is this code correct?" review tasks where the base model can miss subtle bugs.

3. Latency is not user-facing. If the prompt runs in a background batch job at 3 AM, the 30-60 second latency penalty does not matter. Pay it. The cost difference per call is also typically irrelevant at low volumes — a research-summary digest job that runs once a day is fine to put on thinking mode if it improves quality.

4. The cost of a wrong answer dominates the cost of compute. If the model is making a recommendation that gets sent to a human reviewer or to a real-money decision, paying 5x more for the answer is cheap compared to the downstream cost of a bad recommendation. Code review for production systems, contract clause analysis, financial analysis — all places thinking mode pays back.

When thinking is wasted spend

Most production prompts. Tone rewrites do not get better with thinking; the model already knew how to make the email warmer. Classifications do not get better with thinking; the answer is already in the embedding space. Structured extractions where the data is right there in the input do not benefit. JSON formatting does not benefit.

A useful test before turning on thinking: is your existing failure rate driven by the model not "trying hard enough", or by the model not having information it needs? Thinking mode fixes the first case. It does not fix the second. If the right answer requires fresh data the model does not have, thinking will produce a more elaborate wrong answer, not a right one.

A pragmatic ladder

When you suspect a task needs more reasoning than the base model is giving:

  1. Try prompt-engineered chain-of-thought first. Add "think step by step" or "show your reasoning before the answer". Free, fast, often enough.
  2. Try few-shot. Two or three worked examples in the prompt teaches the pattern. Also free, also fast.
  3. Escalate to a larger base model. Claude Opus 4.5 instead of Sonnet 4.5. GPT-4o instead of GPT-4o-mini. Gemini 2.5 Pro instead of Flash. Higher per-call cost, but no thinking-mode latency tax.
  4. Then try thinking mode. Only if (1)–(3) did not move the needle. By this point you should have a benchmark you can compare against.

The order matters. Skipping straight to thinking mode is the most expensive way to find out you needed few-shot.

Next: creative writing — where thinking mode is sometimes a tax with no benefit. :::

Quiz

Module 4: Reasoning modes — when to think, when to skip

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.