Indexing Strategies
HNSW, IVF, PQ, ScaNN — choosing and tuning the right index for your scale
You don't search a vector database by brute force. Behind every fast approximate nearest neighbor (ANN) query is an index data structure that trades a little recall for a lot of speed. Picking the right index — and tuning it correctly — is the difference between 5ms p99 and 500ms p99 at the same recall level. This matters a lot less at 100K vectors than at 100M.
The Core Algorithms
HNSW (Hierarchical Navigable Small World)
The default choice for most workloads. Builds a multi-layer graph where each node connects to its approximate nearest neighbors.
- How it works: at query time, start from a random entry point at the top layer, greedily descend through layers, refining the neighbor set at each level.
- Tuning knobs:
M(connections per node, higher = better recall, more memory),ef_construction(build-time effort),ef_search(query-time effort, higher = better recall, more latency). - Strengths: excellent recall-latency tradeoff, no training step, supports incremental inserts.
- Weaknesses: high memory (the full graph must fit in RAM), slow builds for very large datasets.
- Use when: dataset fits in memory, you need low latency, and you want simplicity.
IVF (Inverted File Index)
Clusters vectors into N buckets using k-means, then only searches the closest buckets at query time.
- How it works: at build time, run k-means to create
nlistcentroids. At query time, find thenprobeclosest centroids, scan only those buckets. - Tuning knobs:
nlist(number of clusters, sqrt(N) is a common starting point),nprobe(clusters to search, higher = better recall, more latency). - Strengths: lower memory than HNSW, works well with disk-based storage, composable with PQ.
- Weaknesses: requires a training step (k-means), recall can be uneven across clusters, no incremental insert without re-clustering.
- Use when: dataset is too large for HNSW in RAM, or you need IVF+PQ for extreme compression.
PQ (Product Quantization)
Compresses vectors by splitting them into sub-vectors and quantizing each to the nearest centroid in a learned codebook. Reduces memory by 4-32x.
- How it works: divide each D-dimensional vector into M sub-vectors, learn a codebook of K centroids per subspace, replace each sub-vector with its centroid ID.
- Tuning knobs:
M(number of sub-quantizers, higher = more accurate, more memory),nbits(bits per code, typically 8). - Strengths: massive memory reduction, enables billion-scale search on commodity hardware.
- Weaknesses: lossy — recall drops compared to uncompressed. Requires a training set.
- Use when: you need to fit billions of vectors into limited memory. Almost always combined with IVF (IVF-PQ).
ScaNN (Scalable Approximate Nearest Neighbors)
Google's library. Uses anisotropic vector quantization — quantization that is aware of the inner product geometry, losing less recall than standard PQ at the same compression ratio.
- Strengths: state-of-the-art recall-speed tradeoff on benchmarks, especially for inner product search.
- Weaknesses: less ecosystem support than FAISS-based indexes, tighter coupling to TensorFlow ecosystem.
- Use when: you are in a Google-adjacent stack and want the best benchmark numbers.
Choosing the Right Index
| Scale | Recommended Index | Why |
|---|---|---|
| <100K vectors | Flat (brute force) or HNSW | At this scale, brute force is fast enough. HNSW if you want headroom. |
| 100K – 10M | HNSW | Best recall-latency tradeoff, fits in RAM. |
| 10M – 100M | HNSW (if RAM allows) or IVF-PQ | Memory becomes the constraint. IVF-PQ if you can't afford the RAM. |
| 100M – 1B+ | IVF-PQ or IVF-HNSW-PQ | Must compress. Composite indexes combine coarse search with quantization. |
Tuning for Recall vs Latency
The fundamental tradeoff: more search effort = higher recall = more latency. Tuning is about finding the knee of the curve for your requirements.
HNSW tuning recipe:
- Set
M=16,ef_construction=200as a baseline. - Sweep
ef_searchfrom 50 to 500. Plot recall@10 vs p99 latency. - Pick the lowest
ef_searchthat meets your recall target (usually 0.95+). - If recall is still too low, increase
M(costs more memory) or try a better embedding model.
IVF-PQ tuning recipe:
- Set
nlist = sqrt(N), train on a representative sample. - Sweep
nprobefrom 1 tonlist/10. Plot recall vs latency. - Increase PQ
M(sub-quantizers) if recall is poor at all nprobe values — you're losing too much to compression.
Practical Patterns
- Benchmark on your data. ANN-benchmarks.com numbers don't transfer. Your data distribution, query distribution, and hardware matter.
- Monitor recall in production. Sample queries, run brute-force offline, compare. Recall drifts as data distribution shifts.
- Separate indexing from serving. Build the index offline (can take hours for large datasets), swap it into the serving path. Don't build indexes under query load.
- Pre-filter then search, not search then filter. Metadata filtering after ANN search can destroy recall. Use indexes that support pre-filtering (HNSW with payload indexes in Qdrant, filtered search in Weaviate).
- Re-index periodically. IVF clusters and HNSW graphs degrade as data drifts. Schedule rebuilds.
The Honest Take
For most teams, the vector database handles index selection and tuning for you. Pinecone, Weaviate, and Qdrant all default to HNSW with sensible parameters. You only need to think about this deeply when you are self-hosting at scale, when latency requirements are extreme, or when your recall numbers don't add up. Start with defaults, measure, and tune only what the measurements tell you to.