Reasoning modes — when to think, when to skip

What 'thinking' actually means across models

4 min read

Every major vendor now ships a reasoning mode. Anthropic calls it extended thinking. OpenAI shipped a separate model family for it (the o-series — o1, o3, o4-mini). Google calls Gemini's variant deep-think or thinking. The marketing language is similar across all three. The actual mechanism is similar too. The cost profile and the right use case differ.

The shared mechanism

In all three implementations, "thinking" means the model produces a hidden reasoning trace before producing the user-visible answer. The trace can be very long — hundreds or thousands of tokens of internal monologue. You pay for those tokens (they count toward output cost on Anthropic and as "reasoning tokens" on OpenAI), and the user-visible answer arrives later because the trace is generated first.

The trace is real chain-of-thought. The model is allowed to backtrack, contradict itself, try multiple approaches, and discard the failures. The visible answer is only what the model ends up confident in.

This is different from prompt-engineered chain-of-thought (the "think step by step" pattern from the Foundation course). Prompt-engineered CoT is a single forward pass with the reasoning written into the visible output. Vendor-shipped thinking modes have a separate hidden phase, and the answer is post-thinking.

What each vendor calls it

VendorMode nameAPI surfacePricing model
AnthropicExtended thinkingthinking: { type: "enabled", budget_tokens: N } parameter on a regular Claude model callThinking tokens billed as output tokens
OpenAIo-series reasoningA separate model name (o1, o3, o4-mini) with its own endpointsReasoning tokens billed as a separate "reasoning" line on the bill
GoogleThinking / deep-thinkthinking_config parameter on Gemini 2.5 ProThinking tokens billed as output tokens

⚠ Prices change frequently. The values above are for illustration only and may be out of date. Always verify current pricing directly with the provider before making cost decisions: Anthropic · OpenAI · Google Gemini · Google Vertex AI · AWS Bedrock · Azure OpenAI · Mistral · Cohere · Together AI · DeepSeek · Groq · Fireworks AI · Perplexity · xAI · Cursor · GitHub Copilot · Windsurf.

Three takeaways from that table. First, on Anthropic and Google, thinking is a flag on a normal model — same model, more compute. On OpenAI, thinking is a different model entirely (o-series), and you cannot enable thinking on a base GPT-4o call. Second, all three add a measurable latency tax — thinking-mode responses can take 10–60+ seconds for hard problems, vs 1–3 seconds for the same prompt on the base model. Third, all three add a measurable cost tax — thinking tokens are not free, and you are paying for the hidden trace whether or not the visible answer changed.

When the latency and cost are worth it

The honest answer is: less often than the marketing suggests. Most production prompts are tone rewrites, classifications, summarisations, and structured extractions — tasks where the base model already has the answer in one forward pass. Adding thinking to those tasks is paying for a feature you do not use.

Thinking pays back when the task has true intermediate reasoning — multi-step puzzles, code generation that requires careful state-tracking, math word problems with several sub-calculations, planning tasks where the answer depends on a long chain of "if X then Y". The next two lessons run a logic puzzle through the three base models so you can see what "without thinking" looks like, then we compare to the cost of escalating to thinking mode.

Next: a logic puzzle across all three base models, no thinking enabled. The output gap is wider than you would expect. :::

Quiz

Module 4: Reasoning modes — when to think, when to skip

Take Quiz
Was this lesson helpful?

Sign in to rate

FREE WEEKLY NEWSLETTER

Stay on the Nerd Track

One email per week — courses, deep dives, tools, and AI experiments.

No spam. Unsubscribe anytime.