pgvector Tuning in 2024, HNSW and IVFFlat in Production

Postgres article cover illustration on a gradient background

November 13, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — HNSW with m=16, ef_construction=64 is the right starting point for most workloads. Tune ef_search per query class for the recall you need. IVFFlat is for 100M+ vectors. Quantize before you shard.

A year ago I’d hedged on pgvector for serious production work. As of pgvector 0.7.4 on Postgres 17, that hedge is gone. I’ve got HNSW indexes under three different workloads in production — semantic search, RAG context retrieval, and a deduplication pipeline — and they hold up.

The reason most pgvector posts age badly is that they treat the library as a black box with magic parameters. It’s not. HNSW and IVFFlat are well-understood algorithms with predictable trade-offs. Once you internalize the trade-offs, the config writes itself.

This post is the production tuning notes I wish I’d had a year ago. I’ll cover index choice, parameter sizing, recall measurement, memory budgeting, and the operational issues nobody warns you about.

Recall is the metric that matters

Approximate nearest neighbor (ANN) indexes trade recall for speed. A 100% recall result is what you’d get from an exhaustive scan. Anything less means you missed some true neighbors. For most production workloads, 95-98% recall is the sweet spot. Below 90%, users notice. Above 99%, you’re paying for negligible quality gains.

To measure recall, you compare ANN results to ground truth. Easy in pgvector:

-- ground truth: brute force, no index hit
SET LOCAL enable_indexscan = off;
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 10;

-- ANN result
SET LOCAL enable_indexscan = on;
SELECT id FROM docs ORDER BY embedding <=> $1 LIMIT 10;

Run both for 100 sample queries, compute intersection / 10 averaged, that’s your recall@10. If you skip this step you have no idea what your users are getting.

HNSW parameters that actually matter

CREATE INDEX docs_emb_hnsw
  ON docs USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

m is the maximum number of neighbors per node in the upper layers. Higher m means denser graph, higher recall, larger index, slower builds. Reasonable values are 12-32. I default to 16 and only go higher if recall measurements demand it.

ef_construction is the candidate list size during build. Higher means a better-shaped graph at the cost of build time. 64 is fine; 200 is justifiable for indexes that get built once and queried billions of times. Building HNSW on a million 1536-dim vectors with ef_construction = 64 takes about 20 minutes on a 16-core box. With 200 it’s an hour.

ef_search is the runtime knob:

SET LOCAL hnsw.ef_search = 100;  -- default 40

This is the one you actually tune. Higher means more graph traversal, higher recall, slower query. I usually run two values — one for hot paths (recall@10 = 95%, ef_search around 60) and one for cold or batch queries (recall@10 = 99%, ef_search around 200).

Postgres 17’s parallel index builds also apply to HNSW in pgvector 0.7.4. Set max_parallel_maintenance_workers = 8 before the CREATE INDEX and build time drops 3-4x.

IVFFlat is for when HNSW won’t fit

If your vectors and HNSW graph fit in RAM, use HNSW. If they don’t, IVFFlat is the next step.

CREATE INDEX docs_emb_ivf
  ON docs USING ivfflat (embedding vector_cosine_ops)
  WITH (lists = 1000);

lists is the number of clusters. The rule of thumb: sqrt(rows) for < 1M, rows / 1000 for larger. Underprovisioning lists makes queries slow because each list scan is large. Overprovisioning makes recall worse because the closest list to a query may not contain the true neighbors.

At query time:

SET LOCAL ivfflat.probes = 10;  -- scan top 10 lists

More probes = higher recall, slower query. Same trade-off as HNSW’s ef_search. For most workloads, probes = sqrt(lists) gives 95% recall.

The IVFFlat killer feature is memory efficiency. A million 1536-dim vectors with IVFFlat needs ~6 GB. HNSW on the same data needs ~12 GB. At 100 million vectors, IVFFlat is the only option that fits on commodity hardware.

The killer downside: IVFFlat clusters are built from a snapshot. As you insert new vectors, they’re added to whichever list their nearest centroid lives in, but the centroids don’t move. After enough inserts, clusters become unbalanced and recall degrades. Rebuild quarterly or after major ingests.

Memory budgeting

This is what most posts skip and what kills the most pgvector deployments.

A vector is 4 * dim bytes. A 1536-dim OpenAI embedding is 6 KB. A million of them is 6 GB just for the raw vectors. The HNSW graph on top adds roughly 4 * m * dim bytes per vector, so m=16 adds another 6 GB. Index plus data: 12 GB per million vectors.

For Postgres to keep the hot path in cache, you want roughly 1.5x that in shared_buffers plus OS cache. So a million 1536-dim vectors really wants 18-24 GB of memory dedicated to vector workload, separate from your relational data.

Two ways to reduce this in pgvector 0.7.4:

Half-precision vectors (halfvec): half the bytes, minor recall impact for most embedding models.

ALTER TABLE docs ADD COLUMN embedding_h halfvec(1536);
UPDATE docs SET embedding_h = embedding::halfvec(1536);
CREATE INDEX ON docs USING hnsw (embedding_h halfvec_cosine_ops);

Binary quantization (bit): 32x smaller, used as a coarse first stage before reranking. Combine with the <~> Hamming distance operator. Two-stage retrieval (bit search, then rerank top-K with full-precision cosine) is the trick that scales pgvector to 100M+ vectors on a single box.

Operational issues

The things you only learn after running this in production for a few months.

Index builds block writes briefly. Use CREATE INDEX CONCURRENTLY for HNSW. It works in 0.7.4. IVFFlat also supports CONCURRENTLY but the build is slower.

VACUUM FULL doesn’t help. Vector indexes are mostly sequential in heap order if you bulk-loaded. VACUUM cleans dead tuples but won’t reduce index size meaningfully. REINDEX CONCURRENTLY is the right knob.

Distance operator matters. <-> is L2 (Euclidean), <#> is negative inner product, <=> is cosine. The operator must match the index’s op_class. Mixing them silently falls back to a seq scan.

Connection pool interaction. PgBouncer 1.23 in transaction mode passes pgvector queries through fine, but SET LOCAL hnsw.ef_search only applies within an explicit transaction. Wrap your queries in BEGIN/COMMIT or use prepared statements with current_setting. See connection pooling with PgBouncer for the broader pattern.

Filtering kills ANN performance. WHERE category = 'news' ORDER BY embedding <=> $1 LIMIT 10 looks reasonable. It’s awful. The HNSW index returns approximate nearest neighbors, then Postgres filters. If most of the top neighbors don’t match the filter, you get few rows or none. The fix is iterative scanning with hnsw.iterative_scan = relaxed_order (new in 0.7.4) or a partial index per category.

SET LOCAL hnsw.iterative_scan = relaxed_order;
SET LOCAL hnsw.max_scan_tuples = 20000;

The pgvector project’s README and the linked PostgreSQL docs are the canonical references when these knobs change.

A real production config

For context, here’s the actual config I run for a 12M vector RAG corpus, 1536-dim, p99 < 30 ms target.

-- index
CREATE INDEX CONCURRENTLY docs_emb_hnsw
  ON docs USING hnsw (embedding_h halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 100);

-- per-session knobs in app
SET LOCAL hnsw.ef_search = 80;
SET LOCAL hnsw.iterative_scan = relaxed_order;
SET LOCAL hnsw.max_scan_tuples = 20000;

-- query
SELECT id, title, content
FROM docs
WHERE tenant_id = $1 AND lang = $2
ORDER BY embedding_h <=> $3::halfvec(1536)
LIMIT 20;

Memory: shared_buffers = 24GB, work_mem = 64MB. The box is a single 32-core, 96 GB RAM, NVMe. Recall@20 measures at 96-97% on a held-out evaluation set. p99 is 24 ms.

Common Pitfalls

A grab bag of footguns.

Using cosine when you should use inner product. If your model normalizes outputs (most do), inner product and cosine return identical rankings but inner product is faster. Check your embedding API docs.
Forgetting to ANALYZE after bulk load. The planner needs stats to decide between seq scan and index scan. Without them, small tables seq scan even after the index is built.
Re-embedding without re-indexing. New model = new vector space = throw away the index. There’s no incremental “rotate” for embeddings.
Mixing dimensions. A column declared vector(1536) won’t accept 768-dim vectors, but vector (unspecified) accepts anything and breaks indexing.
Brittle benchmarks. Don’t benchmark with one query repeated. ANN performance depends on query density in the index. Use a held-out set of real queries.
work_mem too low for sort. HNSW returns rows in approximate distance order, and any post-filter that requires sorting will spill. Set work_mem = 64MB minimum for vector workloads.

The most common production mistake by a margin is treating the embedding model and the index as separate concerns. They’re coupled. Change either, re-validate recall.

Wrapping Up

pgvector 0.7.4 is the version where I’d stop reaching for Pinecone, Weaviate, or Qdrant for any project under 100M vectors. The operational simplicity of having your relational data and your vector data in the same transactionally consistent store is worth a lot, and you don’t pay much for it in performance anymore.

Tune for recall before you tune for latency. Measure both with real queries, not benchmarks. Quantize before you shard, and shard before you migrate off Postgres.

Recall is the metric that matters

HNSW parameters that actually matter

IVFFlat is for when HNSW won’t fit

Memory budgeting

Operational issues

A real production config

Common Pitfalls

Wrapping Up

Related posts

Choosing the Right Postgres Index, BRIN, GIN, HNSW, IVFFlat

Scaling pgvector to Billion Vector Workloads, A Hands On Guide

Lessons From a Year of Rust, Postgres, and AI Agents

Zero Downtime Postgres Migrations in 2024

Vacuum and Bloat Management for Busy Postgres Tables

Native Postgres Partitioning, Patterns That Hold at Scale

Logical Replication for Blue Green Postgres Deploys

Reading EXPLAIN ANALYZE Like a Senior DBA

Let’s Start a Project