Reranking
The second-stage filter that turns good retrieval into great retrieval
Vector search gets you candidates. Reranking picks the winners. The two-stage pattern — fast retrieval followed by expensive reranking — is the single most impactful upgrade you can make to a RAG pipeline that already kind of works. If your users complain that "it finds related stuff but not the right stuff," reranking is your answer.
Why Reranking Exists
Embedding-based retrieval uses bi-encoders: query and document are embedded independently, then compared by cosine similarity. This is fast (you index documents once) but lossy — the model never sees query and document together, so it misses nuanced relevance signals.
Cross-encoders see query and document concatenated as a single input. They can attend across both, catching subtle matches that bi-encoders miss. The cost: you can't pre-compute document embeddings, so you can only run cross-encoders on a small candidate set.
Hence the two-stage pattern:
- Retrieve top-k (50–200) candidates with a fast bi-encoder or hybrid search.
- Rerank the candidates with a cross-encoder. Return the top-n (5–20).
Reranking Options
Cross-encoders
Fine-tuned BERT-style models trained on relevance judgments. The classic choice.
- Models:
ms-marco-MiniLM-L-12-v2,bge-reranker-v2-m3, Jina Reranker v2. - Tradeoff: high quality, moderate latency (10–50ms for 100 candidates), needs GPU for production throughput.
- When to use: you have GPU budget and want the best quality per dollar.
Cohere Rerank
API-based reranker. Send query + documents, get relevance scores back.
- Strength: zero infrastructure, very good quality, supports long documents (up to 4096 tokens per doc).
- Weakness: API latency, per-call cost, data leaves your network.
- When to use: you want reranking without running models yourself.
ColBERT
A late-interaction model — a middle ground between bi-encoders and cross-encoders. Documents are encoded into per-token embeddings at index time. At query time, token-level matching (MaxSim) gives cross-encoder-like quality with bi-encoder-like speed.
- Models: ColBERTv2, PLAID index for fast retrieval.
- Tradeoff: higher storage (per-token vectors), but retrieval+reranking in one step.
- When to use: you need cross-encoder quality at bi-encoder latency and can afford the storage.
RankGPT / LLM-based reranking
Use a large language model to rerank by asking it to sort candidates by relevance. Surprisingly effective.
- Approach: pass the query and N candidates to the LLM, ask it to rank them. Sliding window for large candidate sets.
- Strength: no training data needed, handles complex / multi-hop relevance.
- Weakness: slow, expensive, non-deterministic. The LLM may hallucinate rankings.
- When to use: offline pipelines, complex queries where traditional rerankers struggle, or as a teacher to generate training data for a cross-encoder.
When Reranking Matters
Reranking has the highest ROI when:
- Your queries are natural language and your documents are semi-structured. Bi-encoders struggle with this mismatch.
- You need high precision in the top-5. Reranking dramatically improves precision@5 even when recall@100 is already good.
- Your retrieval returns "close but not quite" results. The classic symptom of bi-encoder ceiling.
- You are doing multi-hop or complex queries. Cross-encoders handle compositionality better.
Reranking matters less when:
- Your queries are keyword-like and BM25 already nails them.
- Latency budget is <50ms total.
- Your candidate pool is tiny (<20 results).
Practical Tips
- Retrieve more than you think. Pull 100–200 candidates for the reranker, return 5–10. The reranker's job is to sift, not search.
- Batch your reranking calls. Cross-encoders are GPU-friendly in batches. Don't call one-by-one.
- Cache reranked results for repeated queries. Most production query distributions are highly skewed.
- Monitor reranker lift. Compare recall@10 with and without reranking. If the delta is <2%, it's not worth the cost.
- Distill LLM rankings. Use RankGPT to label a training set, then fine-tune a small cross-encoder. LLM quality, cross-encoder cost.