Processing Patterns

Map-Reduce, Refine, Stuff, and MapRerank — the core patterns for processing long documents with LLMs

You have a 200-page contract and need to extract every obligation. You can't just dump it in the prompt and pray — or maybe you can, depending on the model and the window. The right processing pattern depends on the document size, the task, and your tolerance for cost versus accuracy.

The Four Patterns

Stuff

The simplest: concatenate everything into the context and ask your question. No chunking, no orchestration.

When it works — the document fits comfortably in the window with room for the system prompt, instructions, and output.
When it breaks — attention degrades over very long contexts, the model misses details buried in the middle, or you hit the token limit.
The appeal — zero complexity. One call. If it works, ship it.

Always try Stuff first. Move to other patterns only when you have evidence it's failing.

Map-Reduce

Split the document into chunks, process each chunk independently (the "map"), then combine the results (the "reduce").

Chunk the document into manageable pieces.
Run each chunk through the model with the same prompt.
Collect all intermediate outputs.
Run a final "reduce" call that synthesizes the chunk outputs into a single answer.

Best for — summarization, extraction, any task where chunk-level answers can be meaningfully merged.
Tradeoff — parallel execution is fast, but each chunk has no awareness of the others. Cross-chunk dependencies get lost.
Cost — N+1 calls instead of 1. Usually cheaper per call but more total tokens.

Refine

Process chunks sequentially, carrying forward a running answer.

Process the first chunk, produce an initial answer.
Feed the next chunk plus the current answer to the model, ask it to update.
Repeat until all chunks are processed.

Best for — tasks where later content should update or refine earlier conclusions. Contract analysis, sequential narrative summarization.
Tradeoff — strictly sequential, so no parallelism. Slower. But the model always has the running context of what it's found so far.
Risk — the running answer can drift or become stale as the refine chain gets long. You may need to re-anchor it.

MapRerank

Map each chunk, then rank the outputs by confidence and pick the best one.

Process each chunk independently with the same question.
Ask the model to score its own confidence in each answer.
Return the highest-confidence answer.

Best for — question-answering over documents where the answer lives in one specific chunk. "Where is the termination clause?" rather than "summarize the contract."
Tradeoff — ignores chunks that don't contain the answer, which is a feature for lookup-style tasks and a bug for holistic ones.

Choosing a Pattern

Task type	First choice	Fallback
Fits in window	Stuff	—
Summarization	Map-Reduce	Refine
Sequential analysis	Refine	Map-Reduce
Point lookup / QA	MapRerank	Stuff
Extraction across full doc	Map-Reduce	Refine

Chunking Matters

Every non-Stuff pattern depends on how you chunk. Bad chunking breaks all of them.

Respect document structure — split on section headers, paragraph breaks, or page boundaries rather than arbitrary token counts.
Overlap chunks — 10-20% overlap prevents losing information at boundaries.
Keep chunks meaningful — a chunk should be self-contained enough for the model to answer the prompt without needing the previous chunk.

Chunked Summarization Chains

A common production pattern: hierarchical summarization with Map-Reduce.

Level 1 — summarize each chunk into a paragraph.
Level 2 — group the paragraph summaries and summarize again.
Repeat until you have a single summary.

This scales to arbitrarily long documents but introduces compression loss at each level. The final summary will lose detail. If detail matters, keep the intermediate summaries accessible and let the user drill down.

The Trend

Context windows keep growing. Patterns that exist because documents didn't fit — Map-Reduce, Refine — become less necessary as windows expand. But they don't disappear, because even with a 1M-token window, attention reliability, cost, and latency still make chunked processing the right call for many workloads.

On this page