Reasoning Models
Dedicated reasoning models — how they work, when to use them, and what they cost
Reasoning models are a different beast from standard LLMs. Where chain-of-thought prompting asks a regular model to show its work, reasoning models are specifically trained to generate long internal reasoning traces before producing an answer. The result is meaningfully better performance on hard problems — at meaningfully higher cost and latency.
The Landscape
The major reasoning models as of early 2025:
- OpenAI o1 / o3 / o4-mini — the o-series. o1 was the first widely available reasoning model. o3 pushed performance further. o4-mini offers a cheaper, faster option that retains most of the reasoning benefit.
- Claude with extended thinking — Anthropic's approach. Rather than a separate model, extended thinking is a mode that gives Claude a scratchpad for internal reasoning before responding.
- DeepSeek-R1 — an open-weight reasoning model that demonstrated you don't need a closed API to get strong reasoning. Notable for being trainable and self-hostable.
- Gemini thinking models — Google's entry, following the same pattern of dedicated reasoning traces.
How They Differ from Standard LLMs
Standard LLMs generate answers token by token, left to right. Whatever reasoning happens is implicit in the generation process. Reasoning models add an explicit thinking phase:
- The model receives the prompt.
- It generates a (sometimes very long) internal reasoning trace — exploring approaches, checking its work, backtracking.
- It produces the final answer.
The key differences:
- Thinking tokens are real compute. They count toward your bill and your latency budget. A hard problem might generate thousands of thinking tokens before a short answer.
- You often can't see the full chain. OpenAI's o-series shows a summary; Claude's extended thinking shows the raw trace. DeepSeek-R1 shows everything.
- They self-correct more reliably. The training process teaches these models to notice and fix mistakes during the thinking phase, not just generate plausible-looking reasoning.
Cost and Latency Tradeoffs
Reasoning models are expensive. Expect:
- 2-10x the token usage compared to a standard model on the same task, because of thinking tokens.
- Higher per-token prices on some providers (OpenAI charges differently for o-series reasoning tokens).
- Significantly higher latency — seconds to minutes for hard problems, versus sub-second for standard models.
The cost is justified when the task is genuinely hard. It's wasted when the task is easy.
When to Use Reasoning Models
Reasoning models earn their cost on:
- Complex math and logic — multi-step proofs, competition-level problems.
- Hard code generation — algorithmic problems, subtle bugs, complex refactors.
- Constrained planning — scheduling, resource allocation, multi-step workflows.
- Analysis requiring multiple perspectives — legal reasoning, medical differential diagnosis, architecture decisions.
- Tasks where correctness matters more than speed — you'd rather wait 30 seconds for a right answer than get a wrong one instantly.
When to Use CoT Prompting Instead
Don't default to reasoning models. Standard models with chain-of-thought prompting are often sufficient and much cheaper:
- The problem is moderately hard — needs some reasoning but not deep search.
- You need low latency — sub-second responses.
- You're running at high volume — the cost difference at scale is enormous.
- The task benefits from few-shot examples — reasoning models sometimes ignore examples because their training pushes them toward their own reasoning style.
A good heuristic: try the task with a standard model and CoT first. If the accuracy isn't good enough, switch to a reasoning model. Don't start with the expensive option.
Routing Between Models
The most cost-effective production pattern is a router that sends easy tasks to fast models and hard tasks to reasoning models:
- Classifier-based routing — train a small model to predict task difficulty; route accordingly.
- Confidence-based routing — run the fast model first; if its confidence is low, escalate to the reasoning model.
- Domain-based routing — certain task types always go to reasoning models (math, code review); others never do (summarization, translation).
The savings are substantial. In most workloads, 80-90% of requests are easy enough for a fast model.
Working with Extended Thinking (Claude)
Claude's extended thinking mode has specific patterns worth knowing:
- Budget control — you can set a maximum thinking budget. Start low and increase only when needed.
- The thinking trace is visible — unlike o-series summaries, you see the raw reasoning. Use it for debugging.
- Thinking works with tools — the model can reason about tool results within its thinking phase.
- Don't over-prompt — reasoning models already know to think step by step. Adding "think carefully" can actually interfere with their trained reasoning patterns.
Working with DeepSeek-R1
DeepSeek-R1 is notable for being open-weight:
- Self-hostable — run it on your own infrastructure for privacy-sensitive reasoning tasks.
- Fine-tunable — you can adapt the reasoning style to your domain.
- Full trace visibility — no summarization, you see every reasoning step.
- Smaller distilled variants — R1-distill models give you some reasoning benefit at lower cost.
The tradeoff: self-hosting means you manage the infrastructure, and the base model is less capable than the top closed models on the hardest tasks.
Practical Guidelines
- Benchmark on your actual tasks, not public benchmarks. Reasoning model advantages vary enormously by domain.
- Set thinking budgets where the API allows it. Unbounded thinking on easy tasks wastes money.
- Don't chain reasoning models in multi-step pipelines unless each step genuinely needs deep reasoning. Use fast models for the easy steps.
- Log thinking traces for debugging, even if you don't show them to users. They're invaluable for understanding failures.
- Watch for reasoning model overconfidence — when they reason their way to a wrong answer, they're often very confident about it. Always pair with verification where possible.