Steven's Knowledge

Window Selection

Choosing between 4K, 32K, 128K, and 1M context windows — cost, latency, and the long context vs RAG tradeoff

"Just use the biggest context window" sounds right until you see the bill. Window selection is an engineering decision that trades off cost, latency, reliability, and architectural complexity. The right choice depends on what you're actually doing.

The Context Window Landscape

As of mid-2025, the practical tiers are:

  • 4K-8K — the legacy range. Still fine for simple chat, classification, short extraction. Cheap and fast.
  • 32K — the sweet spot for many production workloads. Fits a substantial document, a long conversation, or a moderate codebase.
  • 128K — enough for a short book, a full legal contract, or an entire day's conversation history.
  • 200K-1M — frontier territory. Entire codebases, multi-document corpora, long audio transcripts. Available from Anthropic (200K), Google (1M+), and others.

Cost and Latency Scaling

Context length doesn't scale linearly in cost and performance — it's worse than linear.

Cost: most providers charge per-token for both input and output. A 128K-token input costs 16x what an 8K input costs. For high-volume workloads, this adds up fast. Prompt caching helps (repeated prefixes are cheaper on subsequent calls), but you're still paying for the first fill.

Latency: time-to-first-token increases with context length because the model has to process the full input before generating. At 128K tokens, expect noticeably longer TTFT than at 8K. Time-per-output-token is less affected, but the total wall-clock time for a response is still dominated by input processing at long lengths.

Attention quality: as covered in the Needle in a Haystack section, attention reliability degrades with length. More context doesn't always mean better answers.

When to Use Long Context vs RAG

This is the central architectural question. Two schools of thought:

"Just Stuff It All In"

Load everything into the context window and let the model figure it out.

Pros:

  • Dead simple. No retrieval pipeline, no embeddings, no vector store.
  • The model has full context — no risk of missing a relevant chunk.
  • Works well for tasks where the model needs to reason across the full document.

Cons:

  • Cost scales with document size, every call.
  • Attention degrades — the model may miss things in the middle.
  • Doesn't scale to document collections. Even 1M tokens isn't enough for a serious knowledge base.
  • Latency increases with every token you stuff in.

"Retrieve Smartly"

Use RAG or similar retrieval to select relevant chunks, then put only those in the context.

Pros:

  • Cost stays bounded regardless of corpus size.
  • Context is focused — less noise, better attention.
  • Scales to millions of documents.
  • Retrieval can be tuned and improved independently of the model.

Cons:

  • Retrieval can miss relevant information (recall failure).
  • More infrastructure: embeddings, vector store, chunking pipeline.
  • Cross-document reasoning is harder when the model only sees fragments.
  • Chunking decisions affect quality significantly.

The Practical Middle Ground

The best production systems often use both:

  1. Retrieve first — use RAG to find the most relevant documents or chunks.
  2. Stuff generously — put the retrieved content into a large context window, keeping more context than minimal RAG would.
  3. Add full documents when they fit — if a retrieved chunk comes from a 10-page document, consider including the whole document rather than just the chunk.

This gives you the recall of long context with the cost efficiency of retrieval.

Decision Framework

ScenarioRecommendation
Single short document (<8K tokens)Stuff it. Use the smallest window that fits.
Single long document (8K–128K)Stuff it if reliability tests pass. Otherwise, chunk and Map-Reduce.
Multiple documents, one relevantRAG to find it, then stuff the full document.
Multiple documents, all relevantRAG + generous context. Stuff all if they fit; otherwise prioritize.
Entire knowledge baseRAG. Long context is not a replacement for retrieval at scale.
Ongoing conversation (chatbot)Start with full history; switch to summarized + recent when it gets long.

The Economics

A rough way to think about it: if you're calling the model more than ~100 times per day on the same corpus, the cost of a RAG pipeline (embedding, storage, retrieval) is almost certainly less than the cost of stuffing 128K tokens every call. For low-volume, high-value tasks — a lawyer reviewing one contract — stuffing is fine.

Don't Optimize Prematurely

Start with the simplest approach that works:

  1. Try Stuff with a large window.
  2. Measure accuracy, cost, and latency.
  3. If accuracy is fine but cost is too high — add retrieval.
  4. If accuracy is poor — check if it's an attention problem (try restructuring the context) or a retrieval problem (the model needs different information than what you're providing).
  5. If latency is too high — reduce context size via retrieval or compression.

The worst outcome is building a complex RAG pipeline when stuffing the context would have worked perfectly.

On this page