Chunking Strategies

Chunking is the most underrated stage in a RAG pipeline. You can have the best embedding model and the fastest vector DB, but if your chunks are garbage, your retrieval is garbage. The chunk is the atomic unit of retrieval — get it wrong and the model either gets too little context or too much noise.

The Core Tradeoff

Small chunks give precise retrieval but lose surrounding context. Large chunks preserve context but dilute the embedding signal and may exceed what the model can usefully attend to. There is no universal right size — it depends on your documents, your queries, and your model's context window.

Rules of thumb:

256–512 tokens for factoid / Q&A retrieval
512–1024 tokens for longer-form synthesis tasks
Always measure recall on your actual query distribution before committing

Strategies

Fixed-size chunking

Split every N tokens (or characters) with optional overlap. Simple, predictable, works surprisingly well as a baseline.

When to use: you need a baseline fast, or your documents are homogeneous (e.g., plain-text logs).
Weakness: splits mid-sentence, mid-paragraph, mid-thought. The boundaries are meaningless.
Overlap: 10–20% overlap between chunks reduces the chance of splitting critical info across boundaries.

Recursive / hierarchical chunking

Split by the largest structural unit first (double newline, heading, section break), then recursively split any chunk that's still too large. LangChain's RecursiveCharacterTextSplitter popularized this.

When to use: general-purpose documents with natural structure (articles, docs, code).
Strength: respects paragraph and section boundaries when possible.
Weakness: still falls back to arbitrary splitting for long paragraphs.

Semantic chunking

Use an embedding model to detect topic shifts within the document. Split where the cosine similarity between consecutive sentences drops below a threshold.

When to use: documents where topic boundaries matter (research papers, meeting transcripts, mixed-topic pages).
Strength: chunks are topically coherent.
Weakness: slower (requires embedding at ingest time), sensitive to threshold tuning, can produce wildly uneven chunk sizes.

Document-structure-aware chunking

Parse the document's actual structure — Markdown headings, HTML tags, PDF layout, table boundaries — and chunk along those boundaries.

When to use: structured documents (technical docs, legal contracts, financial reports).
Strength: chunks align with the author's intended information units.
Weakness: requires per-format parsers. Falls apart on poorly structured documents.

Overlap Strategies

Overlap creates redundancy at chunk boundaries so that information split across two chunks still appears in at least one.

Token overlap — repeat the last N tokens of chunk k at the start of chunk k+1. Simple, effective.
Sentence overlap — repeat the last 1–2 sentences. More semantically meaningful.
Sliding window — chunks are really windows that overlap by 50%+. Maximum recall, double the storage and indexing cost.

Sweet spot for most use cases: 10–15% overlap by tokens, or 1 sentence overlap.

Practical Patterns

Chunk + parent reference — store small chunks for retrieval, but attach a pointer to the parent section. At generation time, expand to the parent for more context. Best of both worlds.
Multi-scale indexing — index the same document at 2–3 chunk sizes. Retrieve across all scales and deduplicate. Expensive but robust.
Metadata enrichment — attach section title, document title, date, and source to each chunk's metadata. Enables filtered retrieval and helps the model ground its answers.
Chunk quality filtering — after chunking, drop chunks that are too short (<50 tokens), pure boilerplate, or duplicates. Noise in, noise out.

Don't Overthink It

Start with recursive chunking at 512 tokens and 10% overlap. Measure retrieval recall on 50–100 representative queries. Only get fancier if recall is clearly the bottleneck. In most pipelines, improving the embedding model or adding reranking will move the needle more than exotic chunking.

On this page