When a retrieval-augmented generation pipeline starts returning the wrong context, or a recommendation system slows from 30 milliseconds to 300 milliseconds at p99, the root cause is almost always the same — the wrong vector database indexing strategy for the workload. At MinervaDB, we spend a significant portion of every database engineering engagement helping teams pick, tune, and operate vector indexes that hold up under production traffic. This guide walks through the five indexing families that matter most in 2026 — HNSW, IVF, PQ, OPQ, and ScaNN — and explains how we choose between them when latency budgets are tight, recall targets are non-negotiable, and the dataset will not stop growing.
Why Vector Indexing Decides Production Performance
Vector search is the backbone of modern AI applications — semantic search, RAG, recommendation, fraud detection, anomaly detection, and multimodal retrieval. The data structure that makes this possible is the approximate nearest neighbor (ANN) index. A brute-force flat scan over a billion 768-dimensional embeddings is mathematically simple but operationally hopeless. ANN indexes trade a small, tunable amount of recall for orders-of-magnitude reductions in query cost.
The catch is that no single index family wins across every workload. An index optimized for billion-scale offline batch retrieval looks nothing like one tuned for 10-millisecond p99 online inference. We have seen teams adopt the default index in a vector engine, run benchmarks on a 100K-vector sample, and then watch latency collapse the moment production scale arrives. The five algorithms covered here — HNSW, IVF, PQ, OPQ, and ScaNN — are the building blocks every senior database engineer needs to reason about before signing off on a vector platform. For broader context on how we approach analytical and vector workloads end-to-end, the architecture notes published on ChistaDATA are a useful companion read.
HNSW: Graph-Based Search for Low-Latency Workloads
Hierarchical Navigable Small World, introduced by Malkov and Yashunin in 2016, is the index of choice when query latency matters more than memory footprint. HNSW builds a multi-layer proximity graph. The top layers are sparse and contain long-range links; the bottom layer holds every vector with short-range neighbors. A query enters at the top, greedily descends toward the nearest neighbor at each layer, and converges on a high-recall result set in roughly logarithmic time.
Three parameters drive HNSW behavior, and our consulting teams tune all three explicitly rather than relying on defaults:
- M — the number of bidirectional links per node. Larger M produces a denser graph, raises recall, and costs memory. Most production workloads start at M=16 and move up to M=32 or M=48 only when recall targets demand it.
- efConstruction — the candidate list size during index build. Higher values yield a higher-quality graph but lengthen build time. We typically begin at 200 and benchmark upward in steps of 100.
- efSearch — the dynamic candidate list at query time. This is the lever that trades recall for latency in production. It can be adjusted per query without rebuilding the index, which is invaluable for mixed workloads.
A representative configuration using the FAISS library looks like this:
import faiss
d = 768 # embedding dimensionality
M = 32 # links per node
ef_construction = 400
ef_search = 128
index = faiss.IndexHNSWFlat(d, M)
index.hnsw.efConstruction = ef_construction
index.add(xb) # xb: training vectors
index.hnsw.efSearch = ef_search
D, I = index.search(xq, k=10) # xq: query batch
HNSW behaves beautifully on read-heavy, latency-sensitive workloads up to a few hundred million vectors per node. Beyond that, memory consumption — typically 4 × d × N + M × N × 8 bytes — becomes the limiting factor. Authoritative parameter guidance is documented in the Milvus index reference and in the original HNSW paper hosted on arXiv. Our recommendation is to never deploy HNSW at scale without first profiling the recall curve against efSearch on representative production queries — synthetic workloads consistently overstate how well any graph index will perform.
When We Recommend HNSW
We reach for HNSW when the dataset fits comfortably in RAM, p99 latency targets sit under 20 milliseconds, and the workload is dominated by reads. RAG pipelines for enterprise knowledge bases, real-time personalization, and semantic search APIs are textbook fits. HNSW handles incremental inserts gracefully, although deletes require periodic compaction, a detail that many teams discover the hard way.
IVF: Inverted File Indexes for Scalable Partitioning
Where HNSW optimizes for low-latency reads in memory, IVF (Inverted File) optimizes for scalability and balanced trade-offs. The algorithm runs a coarse k-means over the dataset, producing nlist centroids that define Voronoi cells. Every vector is assigned to the cell whose centroid lies closest. At query time, only the nprobe nearest cells are scanned.
The two tuning knobs are straightforward but interact with each other:
- nlist — number of partitions. The conventional rule, cited in the FAISS index documentation, is to start at
nlist = C × sqrt(N)where C ranges from 4 to 16 depending on dataset characteristics. For 100 million vectors, this places nlist somewhere between 40,000 and 160,000. - nprobe — number of cells scanned per query. Larger nprobe raises recall but extends latency linearly. Production deployments commonly run nprobe between 8 and 64, depending on the recall target.
A working FAISS configuration:
import faiss
d = 768
nlist = 16384
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFFlat(quantizer, d, nlist, faiss.METRIC_L2)
index.train(xt) # xt: representative training set, ~10× nlist
index.add(xb)
index.nprobe = 32
D, I = index.search(xq, k=10)
IVF behaves well at billion-scale because the index can be partitioned across shards by cell ID, which makes horizontal scaling straightforward. The cost is recall sensitivity to nprobe — too low and queries miss obvious neighbors that happen to sit in adjacent cells, especially near cell boundaries. We typically pair IVF with a quantization scheme rather than running plain IVFFlat in production, which leads directly to the next section.
When IVF Wins Over HNSW
If the corpus is north of 100 million vectors and the workload tolerates 30–80 millisecond latencies, IVF — particularly IVF combined with PQ or OPQ — almost always beats HNSW on cost per query. The MinervaDB team has migrated several customers off oversized HNSW deployments onto IVF-based indexes, cutting hardware bills by 60% while holding recall above 95%.
Product Quantization (PQ): Compressing High-Dimensional Vectors
Storage is the silent budget killer in any vector platform. A single 768-dimensional float32 embedding consumes 3,072 bytes. A billion of them is 3 TB before any index overhead. Product Quantization, introduced by Jégou, Douze, and Schmid in 2011, solves this by compressing each vector into a compact byte code.
The algorithm is conceptually simple but mathematically elegant:
- Split each d-dimensional vector into
mequally sized subvectors. - For each subvector position, run k-means over the dataset to learn
k = 2^nbitscentroids (typically 256, requiring 8 bits per code). - Replace every subvector with the ID of its nearest centroid.
- Store the resulting
m-byte code instead of the original vector.
A 768-dimensional float32 vector split into m=96 subvectors with 8-bit codes shrinks to 96 bytes — a 32× compression ratio. Distance computation no longer requires the full vector; the search builds a small lookup table at query time and computes Asymmetric Distance Computation (ADC), summing precomputed sub-distances. This is extremely cache-friendly and runs at memory-bandwidth speed.
import faiss
d = 768
m = 96 # number of subvectors (must divide d)
nbits = 8 # 256 centroids per subspace
index = faiss.IndexPQ(d, m, nbits)
index.train(xt)
index.add(xb)
D, I = index.search(xq, k=10)
Production deployments rarely use plain PQ. The pattern that has earned the most production hours across our customer base is IVFPQ — coarse partitioning by IVF, then PQ-compressed residuals inside each partition. The residual is the difference between the vector and the centroid of the cell it landed in, and quantizing residuals rather than raw vectors produces far better recall at the same compression ratio.
import faiss
d = 768
nlist = 8192
m = 96
nbits = 8
quantizer = faiss.IndexFlatL2(d)
index = faiss.IndexIVFPQ(quantizer, d, nlist, m, nbits)
index.train(xt)
index.add(xb)
index.nprobe = 32
D, I = index.search(xq, k=10)
The compression isn’t free. Recall drops compared to flat or HNSW indexes, and the recall hit grows when subspaces are correlated. That weakness is exactly what the next algorithm addresses. For teams already running large-scale analytical infrastructure, the operational disciplines we describe in the MinervaDB consulting practice apply directly — capacity planning, retraining cadence, and recall regression testing all matter more than the choice of m and nbits.
Optimized Product Quantization (OPQ): Rotating Before You Quantize
Standard PQ assumes the subspaces produced by naive splitting carry roughly equal variance and are statistically independent. Real embeddings rarely cooperate. Modern transformer embeddings concentrate variance in a handful of dimensions and exhibit strong cross-dimensional correlations. PQ on such vectors wastes codebook capacity on near-empty subspaces while starving the dimensions that actually matter.
Optimized Product Quantization, published by Ge, He, Ke, and Sun at Microsoft Research, fixes this by learning an orthogonal rotation matrix R that redistributes variance across subspaces before quantization. The optimization jointly minimizes quantization error over both R and the subspace codebooks. The result is meaningfully higher recall at identical compression. The original paper, freely available from Microsoft Research, documents the algorithm in full.
The FAISS index factory makes OPQ trivial to enable:
import faiss
d = 768
nlist = 8192
m = 96
nbits = 8
# OPQ pre-rotation + IVF coarse quantizer + PQ fine quantizer
index = faiss.index_factory(d, f"OPQ{m},IVF{nlist},PQ{m}x{nbits}")
index.train(xt)
index.add(xb)
faiss.ParameterSpace().set_index_parameter(index, "nprobe", 32)
D, I = index.search(xq, k=10)
In our internal benchmarks on a 100-million-vector corpus of 768-dimensional sentence embeddings, replacing IVFPQ with OPQ+IVFPQ at identical compression raised recall@10 from 82% to 91%, with build time increasing roughly 35% and query latency essentially unchanged. The training cost is paid once; the recall benefit compounds across every query for the lifetime of the index. We default to OPQ whenever PQ is on the table — the operational case for plain PQ has nearly disappeared.
When OPQ Doesn’t Help
If embeddings are already approximately isotropic — for example, vectors normalized and rotated by an upstream PCA-whitening step — OPQ adds little. We always profile the variance distribution across dimensions before committing to OPQ. A quick eigenvalue analysis of the embedding covariance matrix reveals whether rotation will pay off.
ScaNN: Anisotropic Quantization for Inner-Product Search
ScaNN, released by Google Research in 2020 and detailed on the Google Research blog, takes a different angle. The team observed that for Maximum Inner Product Search (MIPS) — the similarity metric used by most modern recommendation and dense retrieval systems — the conventional quantization objective is wrong. Standard PQ and OPQ minimize the L2 distance between the original and reconstructed vector. But for inner-product search, the only error that hurts ranking is the error along the direction of the query vector.
ScaNN introduces anisotropic vector quantization, which weights the quantization loss to penalize parallel-component errors more heavily than orthogonal-component errors. The downstream effect is that, for the same code length, ScaNN produces noticeably tighter inner-product approximations than OPQ. Published benchmarks on the ann-benchmarks suite have shown ScaNN leading the throughput-recall Pareto frontier on several MIPS workloads.
The architecture has three stages: a coarse partitioning step that selects candidate partitions, anisotropic quantization that ranks vectors within those partitions, and a final exact re-scoring step over the top candidates. The combination delivers high recall and low latency simultaneously, which is rare in the ANN literature.
import scann
# Inputs:
# dataset: NxD numpy array of normalized embeddings
# queries: QxD numpy array of normalized query vectors
searcher = (
scann.scann_ops_pybind.builder(dataset, 10, "dot_product")
.tree(num_leaves=2000, num_leaves_to_search=100, training_sample_size=250000)
.score_ah(dimensions_per_block=2, anisotropic_quantization_threshold=0.2)
.reorder(reordering_num_neighbors=100)
.build()
)
neighbors, distances = searcher.search_batched(queries, final_num_neighbors=10)
The trade-off is operational. ScaNN is a high-performance library, not a turnkey database. Production deployments typically wrap it in a serving layer that handles persistence, replication, sharding, and updates — concerns the library deliberately leaves to the operator. Teams that need a managed experience often run ScaNN behind a custom service or choose a vector engine that integrates ScaNN as one of several backend options.
Choosing the Right Index: A Practitioner’s Decision Framework
The five algorithms cover overlapping ground, and the right choice is workload-specific rather than universal. We use a short decision tree when scoping vector platform engagements:
| Workload Profile | Dataset Size | Latency Target | Recommended Index |
|---|---|---|---|
| Low-latency RAG / semantic search | < 50M vectors | < 20 ms p99 | HNSW (in-memory) |
| High-throughput recommendation | 50M – 500M vectors | 20–50 ms p99 | OPQ + IVFPQ |
| Billion-scale offline retrieval | > 1B vectors | 100 ms+ batch | IVFPQ on sharded nodes |
| Inner-product / MIPS workloads | 10M – 1B vectors | 10–30 ms p99 | ScaNN |
| Hybrid graph + compression | 100M – 1B vectors | 30–80 ms p99 | HNSW + PQ |
The framework is a starting point. Real engagements always include a recall-latency benchmark on a representative production sample before committing to a production rollout. We have seen the wrong default index choice destroy an otherwise sound architecture. For deeper context on how indexing choices interact with broader analytical platform design, the engineering notes published by the ChistaDATA team cover related territory on columnar and real-time analytics workloads.
Hybrid Indexes: The Quiet Default
In practice, the production-grade index is almost never a single algorithm. The compositions that show up most often in MinervaDB engagements are:
- OPQ + IVF + PQ — variance-balanced compression with partitioned scanning. The workhorse for 100M–1B vector workloads.
- HNSW + Scalar Quantization (SQ8) — graph search with 4× memory reduction from float32 to int8 quantization. A good compromise when memory is tight but graph latency is non-negotiable.
- IVF + HNSW coarse quantizer — replaces the flat coarse quantizer of IVF with an HNSW index over centroids, accelerating cell selection for very large nlist values.
Operational Concerns: Memory, Persistence, and Rebuilds
Algorithm selection is half the battle. The other half is everything the academic literature ignores — how the index behaves in production over months and years.
Memory Footprint
HNSW is memory-hungry: typically 600 MB to 1.6 GB per million 128-dimensional vectors depending on M. IVFPQ with m=64 and 8 bits per code consumes roughly 64 bytes per vector plus codebook and centroid overhead — a 30–50× reduction versus float32 storage. OPQ adds the rotation matrix, which is negligible at d=768 (about 2.4 MB), but should be budgeted explicitly. Capacity planning that ignores these multipliers leads to OOM events at scale.
Index Build Time
Training is the surprise cost. IVFPQ training on 10 million vectors with nlist=16384 and m=96 typically runs in 30–90 minutes on a 32-core machine, dominated by k-means. OPQ adds 25–40% on top. HNSW does not require training in the same sense, but insertion at billion-scale takes hours and is single-threaded per index by default. ScaNN training varies with leaf count and dataset size and benefits significantly from GPU acceleration when available.
Updates, Deletes, and Drift
Most ANN indexes handle inserts more gracefully than deletes. HNSW supports marking nodes deleted, but the graph still traverses ghost edges, and recall degrades as the deletion fraction grows beyond 10–15%. IVF-based indexes handle deletes by removing entries from inverted lists but require periodic rebuilds to rebalance cells. We schedule full rebuilds on a fixed cadence — typically every 30 to 90 days — and trigger out-of-band rebuilds whenever the deletion fraction crosses an alarm threshold.
Embedding drift is the other silent failure mode. When the upstream embedding model is retrained or replaced, the entire index must be rebuilt. We treat embedding versioning as a first-class concern in every vector platform we operate, with explicit index aliases and dual-write windows during cutover. Teams that skip this discipline end up serving stale embeddings against fresh queries and watching relevance metrics collapse without explanation.
Persistence and Recovery
Most vector indexes serialize cleanly to disk, but the persistence story varies widely across engines. We always validate restore-from-cold-storage timing as part of pre-production sign-off. A 200 GB index that takes 90 minutes to mmap from S3 is a recovery-time-objective problem that no algorithm choice will fix.
Key Takeaways
- HNSW is the right default for in-memory, low-latency workloads under 50 million vectors, and the parameters that matter most are M and efSearch.
- IVF dominates at scale by partitioning the search space; pair it with PQ or OPQ rather than running plain IVFFlat in production.
- PQ delivers 30–50× compression and cache-friendly distance computation, but standard PQ wastes capacity on correlated subspaces.
- OPQ adds a learned rotation that redistributes variance, typically delivering 5–10 percentage points of recall improvement at identical compression.
- ScaNN is the strongest choice for inner-product search workloads thanks to anisotropic quantization, though it requires more operational engineering than turnkey vector engines.
- Real-world deployments use hybrid indexes — OPQ+IVFPQ, HNSW+SQ8, and IVF with HNSW coarse quantizers are the most common production patterns.
- Operational discipline — capacity planning, rebuild cadence, embedding versioning, and recovery testing — matters more than the choice of algorithm in the long run.
How MinervaDB Can Help
At MinervaDB, we operate vector platforms for customers running everything from sub-millisecond recommendation systems to billion-scale RAG pipelines. Our database engineering team designs the index strategy, sizes the hardware, builds the benchmark harness, runs the recall regression tests, and operates the platform 24×7 once it is live. Whether the workload sits on Milvus, pgvector, or a custom FAISS-backed service, we bring the same discipline we apply to PostgreSQL, MySQL, and analytical engines — capacity engineering, observability, and operational rigor. If a vector search workload is approaching production scale or already missing latency and recall targets, schedule a consultation with our database engineering team and we will help map the right index strategy to the workload.
Frequently Asked Questions
Which vector database indexing algorithm offers the best recall?
A brute-force flat index delivers 100% recall by definition. Among approximate methods, HNSW with high M and efSearch values reaches 98–99% recall on most embedding distributions, followed closely by OPQ+IVFPQ with generous nprobe settings. ScaNN matches or exceeds OPQ recall on inner-product workloads. The right metric is always recall@k on a representative query workload, not theoretical worst-case bounds.
How much memory does HNSW require versus IVFPQ?
HNSW stores full float32 vectors plus graph edges — typically 600 MB to 1.6 GB per million 128-dimensional vectors at M=16 to M=48. IVFPQ with 64-byte codes and 8-bit quantization consumes roughly 64 bytes per vector plus codebook overhead. For a 100-million-vector corpus, that is the difference between 60 GB and 6.4 GB, which often decides whether a workload fits on a single node.
When should I use OPQ instead of plain PQ?
Always use OPQ when the embeddings come from a deep learning model — transformer outputs, CLIP embeddings, sentence encoders. These embeddings exhibit strong variance imbalance and inter-dimensional correlations that OPQ corrects with a learned rotation. Plain PQ is acceptable only when embeddings are already whitened or otherwise isotropic, which is rare in production AI workloads.
Is ScaNN production-ready for enterprise workloads?
The ScaNN library is mature and battle-tested inside Google, and the algorithm consistently leads ANN benchmarks on inner-product workloads. The library itself, however, is a search engine rather than a database — it lacks built-in replication, sharding, persistence guarantees, and update primitives. Production deployments wrap ScaNN in a serving layer or adopt a vector engine that integrates ScaNN as a backend.
How often should a vector index be rebuilt?
The right rebuild cadence depends on data churn and deletion fraction. Most production deployments we operate run full rebuilds every 30 to 90 days, with out-of-band rebuilds triggered when the deletion fraction crosses 10–15% or when the upstream embedding model is retrained. Embedding model changes always require a full rebuild — incremental updates are not safe across embedding versions.
Can I combine HNSW and PQ in the same index?
Yes, and the combination is one of the most useful hybrid patterns. HNSW+PQ stores PQ codes at each graph node instead of full vectors, cutting memory consumption by an order of magnitude while preserving the graph traversal advantages of HNSW. Recall drops slightly compared to HNSW over float32 vectors, but the memory savings often make the difference between fitting the index on one node versus sharding across many.