Chunking Strategies
How you split documents determines how well you retrieve them
Chunking is the most underrated stage in a RAG pipeline. You can have the best embedding model and the fastest vector DB, but if your chunks are garbage, your retrieval is garbage. The chunk is the atomic unit of retrieval — get it wrong and the model either gets too little context or too much noise.
The Core Tradeoff
Small chunks give precise retrieval but lose surrounding context. Large chunks preserve context but dilute the embedding signal and may exceed what the model can usefully attend to. There is no universal right size — it depends on your documents, your queries, and your model's context window.
Rules of thumb:
- 256–512 tokens for factoid / Q&A retrieval
- 512–1024 tokens for longer-form synthesis tasks
- Always measure recall on your actual query distribution before committing
Strategies
Fixed-size chunking
Split every N tokens (or characters) with optional overlap. Simple, predictable, works surprisingly well as a baseline.
- When to use: you need a baseline fast, or your documents are homogeneous (e.g., plain-text logs).
- Weakness: splits mid-sentence, mid-paragraph, mid-thought. The boundaries are meaningless.
- Overlap: 10–20% overlap between chunks reduces the chance of splitting critical info across boundaries.
Recursive / hierarchical chunking
Split by the largest structural unit first (double newline, heading, section break), then recursively split any chunk that's still too large. LangChain's RecursiveCharacterTextSplitter popularized this.
- When to use: general-purpose documents with natural structure (articles, docs, code).
- Strength: respects paragraph and section boundaries when possible.
- Weakness: still falls back to arbitrary splitting for long paragraphs.
Semantic chunking
Use an embedding model to detect topic shifts within the document. Split where the cosine similarity between consecutive sentences drops below a threshold.
- When to use: documents where topic boundaries matter (research papers, meeting transcripts, mixed-topic pages).
- Strength: chunks are topically coherent.
- Weakness: slower (requires embedding at ingest time), sensitive to threshold tuning, can produce wildly uneven chunk sizes.
Document-structure-aware chunking
Parse the document's actual structure — Markdown headings, HTML tags, PDF layout, table boundaries — and chunk along those boundaries.
- When to use: structured documents (technical docs, legal contracts, financial reports).
- Strength: chunks align with the author's intended information units.
- Weakness: requires per-format parsers. Falls apart on poorly structured documents.
Overlap Strategies
Overlap creates redundancy at chunk boundaries so that information split across two chunks still appears in at least one.
- Token overlap — repeat the last N tokens of chunk k at the start of chunk k+1. Simple, effective.
- Sentence overlap — repeat the last 1–2 sentences. More semantically meaningful.
- Sliding window — chunks are really windows that overlap by 50%+. Maximum recall, double the storage and indexing cost.
Sweet spot for most use cases: 10–15% overlap by tokens, or 1 sentence overlap.
Practical Patterns
- Chunk + parent reference — store small chunks for retrieval, but attach a pointer to the parent section. At generation time, expand to the parent for more context. Best of both worlds.
- Multi-scale indexing — index the same document at 2–3 chunk sizes. Retrieve across all scales and deduplicate. Expensive but robust.
- Metadata enrichment — attach section title, document title, date, and source to each chunk's metadata. Enables filtered retrieval and helps the model ground its answers.
- Chunk quality filtering — after chunking, drop chunks that are too short (<50 tokens), pure boilerplate, or duplicates. Noise in, noise out.
Don't Overthink It
Start with recursive chunking at 512 tokens and 10% overlap. Measure retrieval recall on 50–100 representative queries. Only get fancier if recall is clearly the bottleneck. In most pipelines, improving the embedding model or adding reranking will move the needle more than exotic chunking.