Steven's Knowledge

Reasoning Tradeoffs

Fast thinking vs slow thinking — when to spend compute on reasoning and when to skip it

Every reasoning technique costs something: tokens, latency, money. The art isn't knowing how to make a model think harder — it's knowing when to. Most requests hitting your API don't need chain of thought, extended thinking, or self-reflection. The ones that do need it really need it. Getting the routing right is the single biggest lever for cost-efficient AI systems.

Fast Thinking vs Slow Thinking

Borrow Kahneman's framing: models have a System 1 (fast, pattern-matching, cheap) and a System 2 (slow, deliberate, expensive). The parallel isn't perfect, but it's useful:

  • System 1 tasks: sentiment analysis, classification, simple extraction, translation, summarization. The model knows the answer immediately. Adding reasoning adds cost without adding accuracy.
  • System 2 tasks: multi-step math, code with tricky logic, constrained planning, problems requiring search over possibilities. These genuinely benefit from more thinking.

The mistake most teams make is applying System 2 everywhere. It feels safer — "let the model think" — but it's wasteful and sometimes counterproductive.

When NOT to Use Reasoning

Be explicit about when to skip it:

  • Classification tasks — sentiment, intent detection, toxicity scoring. These are pattern recognition, not reasoning. CoT typically adds 0% accuracy and 100% more tokens.
  • Simple extraction — pulling a name, date, or number from text. The model sees it or it doesn't; no chain will help.
  • Translation — unless the text contains domain-specific ambiguity, standard generation is better. CoT in translation produces awkward, over-literal output.
  • Creative writing — step-by-step reasoning kills spontaneity. The output reads like a committee wrote it.
  • High-volume, low-stakes tasks — if you're classifying millions of support tickets, the 2x cost of CoT multiplied by volume is enormous, and the marginal accuracy gain is negligible.

Rule of thumb: if a human expert could answer in under 5 seconds, the model probably doesn't need reasoning either.

Compute-Optimal Reasoning

The goal is matching reasoning effort to task difficulty — spending just enough compute and no more.

Strategies:

  1. Tiered models — use a fast, cheap model for easy tasks and a reasoning model for hard ones. This is the biggest win.
  2. Adaptive thinking budgets — for APIs that support it (Claude extended thinking), set the thinking budget based on estimated task complexity.
  3. Early stopping — if the model's intermediate reasoning already looks confident and correct, stop generating more thinking tokens.
  4. Self-consistency with adaptive N — start with 1 sample. If confidence is high, stop. If low, sample more. Don't always sample 10 times.

Reasoning Budget Management

Treat reasoning compute like a real budget:

  • Set per-request thinking caps. Claude's extended thinking allows a max tokens parameter for thinking. Use it.
  • Track reasoning token spend separately from output tokens in your monitoring. A sudden spike in thinking tokens per request signals something changed.
  • Budget by task type. Math problems get generous thinking budgets; summarization gets none. Make this explicit in your routing logic.
  • A/B test reasoning levels. Run the same workload with and without CoT (or with different thinking budgets) and measure the actual quality difference. Often it's smaller than you expect.

Test-Time Compute Scaling

A key insight from reasoning model research: you can trade more compute at inference time for better results, similar to how more training compute improves base models. This is test-time compute scaling.

How it manifests:

  • More thinking tokens — giving the model a larger thinking budget improves accuracy on hard problems, with diminishing returns.
  • More samples — generating multiple attempts and selecting the best one (via self-consistency, verifier, or best-of-N).
  • More refinement iterations — each generate-critique-revise cycle uses compute and improves quality.

The scaling is real but sublinear. Doubling your thinking budget doesn't double your accuracy. The practical implication: there's a sweet spot for each task difficulty, and overshooting it wastes money.

Scaling curves to know:

  • Easy tasks: flat. More compute doesn't help.
  • Medium tasks: steep initially, then flat. A moderate thinking budget captures most of the gain.
  • Hard tasks: gradual improvement over a wide range. This is where big thinking budgets pay off.

The Cost Landscape

Put concrete numbers on it. For a typical workload:

ApproachRelative costRelative latencyWhen to use
Standard model, no CoT1x1xSimple tasks, high volume
Standard model + CoT2-3x2-3xModerate reasoning tasks
Reasoning model, low budget3-5x3-5xHard tasks, latency-tolerant
Reasoning model, high budget5-15x5-20xVery hard tasks, correctness-critical
Self-consistency (N=5)5x1x (parallel)When accuracy matters, latency doesn't
Generate + verify pipeline2-4x2-3xWhen external validation exists

Building a Reasoning Strategy

For a production system, the full strategy looks like:

  1. Classify tasks by difficulty — build a rubric or train a classifier.
  2. Assign reasoning levels — none, light CoT, reasoning model, reasoning model + verification.
  3. Set budgets per level — thinking token caps, sample counts, iteration limits.
  4. Monitor and adjust — track accuracy and cost per reasoning level. Move the boundaries as models improve.
  5. Default to less reasoning — when in doubt, start with the cheaper option and escalate only on evidence.

The best AI teams aren't the ones using the most powerful reasoning everywhere. They're the ones using the right amount of reasoning for each task.

On this page