Reranking

Vector search gets you candidates. Reranking picks the winners. The two-stage pattern — fast retrieval followed by expensive reranking — is the single most impactful upgrade you can make to a RAG pipeline that already kind of works. If your users complain that "it finds related stuff but not the right stuff," reranking is your answer.

Why Reranking Exists

Embedding-based retrieval uses bi-encoders: query and document are embedded independently, then compared by cosine similarity. This is fast (you index documents once) but lossy — the model never sees query and document together, so it misses nuanced relevance signals.

Cross-encoders see query and document concatenated as a single input. They can attend across both, catching subtle matches that bi-encoders miss. The cost: you can't pre-compute document embeddings, so you can only run cross-encoders on a small candidate set.

Hence the two-stage pattern:

Retrieve top-k (50–200) candidates with a fast bi-encoder or hybrid search.
Rerank the candidates with a cross-encoder. Return the top-n (5–20).

Reranking Options

Cross-encoders

Fine-tuned BERT-style models trained on relevance judgments. The classic choice.

Models: ms-marco-MiniLM-L-12-v2, bge-reranker-v2-m3, Jina Reranker v2.
Tradeoff: high quality, moderate latency (10–50ms for 100 candidates), needs GPU for production throughput.
When to use: you have GPU budget and want the best quality per dollar.

Cohere Rerank

API-based reranker. Send query + documents, get relevance scores back.

Strength: zero infrastructure, very good quality, supports long documents (up to 4096 tokens per doc).
Weakness: API latency, per-call cost, data leaves your network.
When to use: you want reranking without running models yourself.

ColBERT

A late-interaction model — a middle ground between bi-encoders and cross-encoders. Documents are encoded into per-token embeddings at index time. At query time, token-level matching (MaxSim) gives cross-encoder-like quality with bi-encoder-like speed.

Models: ColBERTv2, PLAID index for fast retrieval.
Tradeoff: higher storage (per-token vectors), but retrieval+reranking in one step.
When to use: you need cross-encoder quality at bi-encoder latency and can afford the storage.

RankGPT / LLM-based reranking

Use a large language model to rerank by asking it to sort candidates by relevance. Surprisingly effective.

Approach: pass the query and N candidates to the LLM, ask it to rank them. Sliding window for large candidate sets.
Strength: no training data needed, handles complex / multi-hop relevance.
Weakness: slow, expensive, non-deterministic. The LLM may hallucinate rankings.
When to use: offline pipelines, complex queries where traditional rerankers struggle, or as a teacher to generate training data for a cross-encoder.

When Reranking Matters

Reranking has the highest ROI when:

Your queries are natural language and your documents are semi-structured. Bi-encoders struggle with this mismatch.
You need high precision in the top-5. Reranking dramatically improves precision@5 even when recall@100 is already good.
Your retrieval returns "close but not quite" results. The classic symptom of bi-encoder ceiling.
You are doing multi-hop or complex queries. Cross-encoders handle compositionality better.

Reranking matters less when:

Your queries are keyword-like and BM25 already nails them.
Latency budget is <50ms total.
Your candidate pool is tiny (<20 results).

Practical Tips

Retrieve more than you think. Pull 100–200 candidates for the reranker, return 5–10. The reranker's job is to sift, not search.
Batch your reranking calls. Cross-encoders are GPU-friendly in batches. Don't call one-by-one.
Cache reranked results for repeated queries. Most production query distributions are highly skewed.
Monitor reranker lift. Compare recall@10 with and without reranking. If the delta is <2%, it's not worth the cost.
Distill LLM rankings. Use RankGPT to label a training set, then fine-tune a small cross-encoder. LLM quality, cross-encoder cost.

On this page