Needle in a Haystack

Attention degradation, position bias, and practical mitigations for long-context reliability

The "Needle in a Haystack" (NIAH) test is simple: hide a specific fact somewhere in a long context, then ask about it. It reveals something uncomfortable — models don't attend to all parts of their context equally, and the longer the context, the worse it gets.

The Problem

NIAH tests show that models reliably find information placed at the beginning or end of the context but struggle with material in the middle. This isn't a bug in one model — it's a structural property of transformer attention.

Key findings from the research:

Lost in the middle — retrieval accuracy drops significantly for information placed between 30-70% of the way through the context. This was first demonstrated by Liu et al. (2023) and has been replicated across model families.
Degradation scales with length — a model that's 99% accurate on NIAH at 8K tokens might be 85% at 128K. The curve varies by model, but the trend is universal.
Task complexity amplifies the effect — simple fact retrieval degrades less than reasoning over retrieved facts. If the model needs to find a fact AND use it in a multi-step argument, the reliability drop is steeper.

Position Bias in Practice

This isn't just a benchmark curiosity. It shows up in real workloads:

Document QA — the answer to a question buried in the middle of a 50-page document gets missed more often than one near the top or bottom.
Multi-document synthesis — when you concatenate multiple documents, information in the middle documents gets less attention.
Code review — important changes in the middle of a large diff are more likely to be overlooked.

Practical Mitigations

Strategic Information Placement

If you control where information appears in the context:

Put critical content first — system prompts, key instructions, and the most important context should be at the top.
Repeat key information — state important facts both at the beginning and near the end.
Recency matters — the model attends strongly to the end of the context. Put the user's question last, and consider restating key constraints right before it.

Retrieval Augmentation

Don't rely on the model to find the needle — retrieve it yourself.

Hybrid approach — use RAG to find relevant chunks, then place those chunks strategically in the context rather than dumping the entire document.
Two-stage retrieval — first retrieve broadly, then use a second pass to re-rank and select the most relevant pieces.
Chunk-and-cite — break the document into labeled chunks, retrieve the relevant ones, and ask the model to cite which chunk it's drawing from. This forces active engagement with the source.

Structural Cues

Help the model navigate long contexts:

Section headers and numbering — add explicit structure markers that the model can reference.
Table of contents — prepend a TOC at the start of long documents so the model knows what's where.
XML or markdown tags — wrap distinct sections in tags. Models attend to structural boundaries.

Redundancy and Verification

Ask twice, different ways — query for the same information with different phrasings and compare answers.
Multi-pass extraction — split the document into overlapping windows and run each separately, then merge results. Brute force but effective.
Self-verification — ask the model to quote the exact passage it's basing its answer on. If it can't, it may be hallucinating rather than retrieving.

Benchmarking Your Own Model

Don't trust vendor NIAH benchmarks blindly. Run your own:

Create test cases — hide known facts at different positions (10%, 25%, 50%, 75%, 90%) in contexts of different lengths.
Vary the task — test simple retrieval, multi-hop reasoning, and extraction-with-reasoning.
Measure at your actual context lengths — if you're using 32K tokens in production, benchmark at 32K, not just at the vendor's best-case length.
Test with your actual content type — performance varies between code, legal text, conversation transcripts, and technical documentation.
Track over time — model updates can change the reliability curve. Re-run after provider updates.

The result is a heatmap: position vs. context length vs. accuracy. This is your ground truth for deciding how much to trust the model's attention at different window sizes.

The Honest Assessment

Long context is real and useful, but it's not magic. Models don't read 128K tokens the way a human reads a book — they attend to some parts more than others, and the middle gets shortchanged. Build your systems knowing this. Use long context for coverage, but don't bet correctness on a single fact buried at token 60,000.

On this page