background-shape
Scaling pgvector to Billion Vector Workloads, A Hands On Guide
June 4, 2025 · 10 min read · by Muhammad Amal programming

TL;DR — pgvector 0.8 with HNSW, binary quantization, and partitioned tables can comfortably serve a billion vectors on a single Postgres 17 box. The trick is sizing your index correctly, using the right distance metric, and not letting maintenance_work_mem lie to you.

There’s still a persistent belief that Postgres can’t handle billion-scale vector search, that you need a dedicated vector database like Pinecone or Milvus once you cross some magic threshold. I’ve been running pgvector at hundreds of millions of vectors in production for the past year, and the limits keep moving up. With pgvector 0.8 (released early 2025) and Postgres 17, a single beefy node can serve a billion vectors with p99 latency under 50ms.

This guide is the playbook I wish I’d had when I started scaling up. It covers index choice, quantization, partitioning, and the specific gotchas that turn a smooth migration into a weekend on-call. If you’re still on pgvector 0.7 or earlier, upgrade first. The performance gains in 0.8 are not marginal.

I assume you already know what HNSW is and have a working pgvector deployment. If not, my earlier post on pgvector tuning in 2024 covers the basics.

1. Choose Your Index Like You Mean It

The first decision is HNSW versus IVFFlat. In pgvector 0.8 it’s almost always HNSW. IVFFlat still has a niche for write-heavy workloads with weak recall requirements, but for read-dominated production search, HNSW wins on every dimension that matters except build memory.

HNSW parameters that matter

CREATE INDEX ON embeddings USING hnsw (embedding vector_cosine_ops)
WITH (m = 32, ef_construction = 128);
  • m: graph connectivity. Higher means better recall but bigger index. For billion-scale I use 32. For smaller datasets, 16 is fine.
  • ef_construction: build-time exploration. Higher means slower build but better index quality. 128 is my floor at scale.

The query-time knob is hnsw.ef_search, which is per-session:

SET hnsw.ef_search = 100;  -- higher = better recall, slower
SELECT id FROM embeddings
ORDER BY embedding <=> '[0.1, 0.2, ...]'::vector
LIMIT 10;

At a billion vectors with m=32, I run ef_search = 100 for typical queries and 200 for high-recall use cases. Beyond 400 you’re paying a lot of latency for marginal recall improvements.

Index size math

A single 768-dimensional vector at float32 is 3 KB. A billion of them is 3 TB raw. The HNSW index adds roughly 4 * m bytes per vector for graph edges, so another 128 GB. That’s before you talk about WAL, the heap, and any other indexes.

+---------------------------+
|   Heap                    |  ~3 TB (raw vectors)
+---------------------------+
|   HNSW index              |  ~128 GB (graph edges)
+---------------------------+
|   TOAST overflow          |  ~100 GB
+---------------------------+
|   WAL during build        |  ~500 GB (transient)
+---------------------------+

This is why people reach for vector databases. But you don’t have to store float32. That’s the secret weapon.

2. Quantization Changes The Math

pgvector 0.8 supports three storage formats for HNSW indexes: vector (float32), halfvec (float16), and bit (binary). Each is a 2x or 32x size reduction.

Half-precision is essentially free

-- Migrate to halfvec for 2x size reduction
ALTER TABLE embeddings ADD COLUMN embedding_half halfvec(768);
UPDATE embeddings SET embedding_half = embedding::halfvec(768);
CREATE INDEX ON embeddings USING hnsw (embedding_half halfvec_cosine_ops)
WITH (m = 32, ef_construction = 128);

Recall loss is under 0.5% on every embedding model I’ve tested (OpenAI text-embedding-3, Cohere embed-v3, BGE). Latency improves because the index fits in less of shared_buffers. There’s no downside for most production workloads.

Binary quantization for the truly large

For billion-vector workloads, binary quantization is the move. You drop from 3 KB per vector to 96 bytes (768 bits). That’s a 32x reduction. Recall drops, but you can recover most of it with a reranking step.

ALTER TABLE embeddings ADD COLUMN embedding_bin bit(768);
UPDATE embeddings SET embedding_bin = binary_quantize(embedding);
CREATE INDEX ON embeddings USING hnsw (embedding_bin bit_hamming_ops)
WITH (m = 32, ef_construction = 128);

The pattern is a two-stage retrieval. First, search the binary index for the top 200 candidates. Then rerank using the full-precision vectors stored in the heap.

WITH candidates AS (
    SELECT id, embedding
    FROM embeddings
    ORDER BY embedding_bin <~> binary_quantize('[0.1, ...]'::vector)
    LIMIT 200
)
SELECT id, embedding <=> '[0.1, ...]'::vector AS distance
FROM candidates
ORDER BY distance
LIMIT 10;

Recall on this two-stage retrieval is typically 95-98% of full-precision HNSW for a fraction of the storage and memory cost.

3. Partition By Tenant Or Time

A single billion-row table works, but it’s hostile to operations. Backups take forever, vacuum is expensive, and schema changes are painful. Partition.

Multi-tenant partitioning

If your vectors belong to tenants, partition by tenant. Each tenant gets its own partition, its own index, and its own maintenance window.

CREATE TABLE embeddings (
    id bigint NOT NULL,
    tenant_id int NOT NULL,
    embedding vector(768),
    created_at timestamptz DEFAULT now()
) PARTITION BY HASH (tenant_id);

CREATE TABLE embeddings_p0 PARTITION OF embeddings
    FOR VALUES WITH (modulus 16, remainder 0);
-- ... and so on for p1 through p15

Hash partitioning by tenant gives you 16 roughly equal partitions. Each one indexes independently. A query with WHERE tenant_id = $1 will partition-prune to a single partition automatically.

Time-based partitioning

For workloads where recency matters (most of them), partition by month:

CREATE TABLE embeddings (
    id bigint NOT NULL,
    embedding vector(768),
    created_at timestamptz NOT NULL
) PARTITION BY RANGE (created_at);

CREATE TABLE embeddings_2025_06 PARTITION OF embeddings
    FOR VALUES FROM ('2025-06-01') TO ('2025-07-01');

Older partitions can move to cheaper storage or get dropped entirely. Active partitions stay hot in shared_buffers.

4. Build Indexes Without Blowing Up

Building an HNSW index on a billion vectors is the most operationally tricky part of this whole exercise. The naive approach will OOM your box.

Memory sizing

HNSW build memory is roughly (vector_size + 4*m) * row_count. For 768-dim float32 vectors at m=32 with a billion rows, that’s about 3.5 TB of memory. You don’t have 3.5 TB of memory. So you need to either use quantization (much smaller working set), partition the build, or both.

-- For build phase only
SET maintenance_work_mem = '64GB';
SET max_parallel_maintenance_workers = 8;

pgvector 0.8 supports parallel HNSW build, which is a huge win. With 8 workers and 64 GB of maintenance_work_mem, a 100 million vector index builds in roughly 90 minutes on a modern box.

Build per partition

For partitioned tables, build indexes one partition at a time. This keeps memory predictable:

DO $$
DECLARE
    partition_name text;
BEGIN
    FOR partition_name IN
        SELECT inhrelid::regclass::text
        FROM pg_inherits
        WHERE inhparent = 'embeddings'::regclass
    LOOP
        EXECUTE format(
            'CREATE INDEX CONCURRENTLY ON %s USING hnsw (embedding vector_cosine_ops) WITH (m = 32, ef_construction = 128)',
            partition_name
        );
        RAISE NOTICE 'Built index on %', partition_name;
    END LOOP;
END $$;

CREATE INDEX CONCURRENTLY doesn’t block writes during the build, which matters when you’re rebuilding indexes on a live system.

5. Query Patterns That Don’t Suck

A vector search is rarely just a vector search. You almost always have filters. The trick is making sure your filters and your index cooperate.

Pre-filter vs post-filter

The naive query is:

SELECT id, embedding <=> $1 AS distance
FROM embeddings
WHERE tenant_id = $2 AND created_at > now() - interval '30 days'
ORDER BY embedding <=> $1
LIMIT 10;

This is post-filter. pgvector walks the HNSW graph, collecting candidates, then filters them. If your filter is selective, you might exhaust the HNSW search before finding 10 matching results.

pgvector 0.8 introduced iterative_scan which dramatically helps:

SET hnsw.iterative_scan = 'relaxed_order';
SET hnsw.max_scan_tuples = 20000;

With iterative scan, pgvector keeps expanding the search until it finds enough matching rows or hits the limit. For low-selectivity filters this is fine. For high-selectivity filters (e.g., a single tenant with 0.01% of rows), you want to push the filter into the partition.

Combining vector similarity with full-text search is a common pattern. The combined query looks like:

WITH vector_hits AS (
    SELECT id, embedding <=> $1 AS dist
    FROM embeddings
    ORDER BY dist
    LIMIT 100
),
text_hits AS (
    SELECT id, ts_rank(tsv, plainto_tsquery($2)) AS rank
    FROM documents
    WHERE tsv @@ plainto_tsquery($2)
    LIMIT 100
)
SELECT COALESCE(v.id, t.id) AS id,
       (1.0 / (60 + COALESCE(v.dist, 1))) +
       (COALESCE(t.rank, 0) * 0.5) AS score
FROM vector_hits v
FULL OUTER JOIN text_hits t USING (id)
ORDER BY score DESC
LIMIT 10;

This is Reciprocal Rank Fusion. It’s dumb, it works, and it’s what most production systems use. I wrote about it in detail in hybrid retrieval with pgvector.

6. Monitoring Vector Search In Production

The standard Postgres monitoring tools work, but you want vector-specific metrics on top.

Index health

SELECT
    schemaname, indexrelname,
    pg_size_pretty(pg_relation_size(indexrelid)) AS size,
    idx_scan, idx_tup_read, idx_tup_fetch
FROM pg_stat_user_indexes
WHERE indexrelname LIKE '%hnsw%' OR indexrelname LIKE '%ivfflat%'
ORDER BY pg_relation_size(indexrelid) DESC;

If idx_scan is zero, the planner isn’t using your index. Usually because the query has a filter that confuses it or the LIMIT is too high.

Recall measurement

Recall is the percentage of true nearest neighbors your approximate search returns. You need to measure it. The standard pattern is to keep a small “ground truth” sample and compare:

WITH ground_truth AS (
    SELECT id FROM embeddings
    ORDER BY embedding <=> '[query_vec]'::vector
    LIMIT 10
),
approx AS (
    SELECT id FROM embeddings
    ORDER BY embedding <=> '[query_vec]'::vector
    LIMIT 10
)
SELECT count(*)::float / 10 AS recall_at_10
FROM ground_truth gt
WHERE gt.id IN (SELECT id FROM approx);

The ground truth query bypasses the index using a sequential scan. It’s slow. Run it on a sample of queries periodically (say 100 queries per day) and track recall over time. If recall degrades, your ef_search is too low or your index needs a rebuild.

Official docs for vector operators live at the pgvector repo, but the Postgres extension documentation covers the broader extension landscape.

Common Pitfalls

1. Building HNSW without enough memory

If maintenance_work_mem is too small, the HNSW build will spill to disk and take 10x longer. Check before you build. The build will tell you in the logs if it’s running out.

2. Using vector_l2_ops when you meant vector_cosine_ops

The L2 distance gives different results than cosine distance. If your embeddings are normalized (most are), the rankings are similar but not identical. Pick one and stick with it. Using the wrong operator class in the index means your queries silently produce wrong results.

3. Ignoring vacuum on vector tables

A vector table that gets updated will accumulate dead tuples and the HNSW index will degrade. Vacuum is even more important for vector tables than regular tables. Tune autovacuum aggressively per table.

4. Trusting recall numbers from benchmarks

Public benchmarks use specific datasets (SIFT, GIST) that don’t match your embedding distribution. Always measure recall on your data. The same m and ef_search settings can give 99% recall on one embedding model and 85% on another.

Troubleshooting

Query latency p99 spiking

Almost always either eviction from shared_buffers (index is bigger than RAM) or autovacuum running on the table. Check pg_stat_io for eviction rates and pg_stat_progress_vacuum for active vacuums.

Recall dropped after a batch insert

HNSW indexes do degrade with heavy inserts. The graph quality drops slowly over months. The fix is REINDEX CONCURRENTLY:

REINDEX INDEX CONCURRENTLY embeddings_embedding_idx;

On large tables this takes hours but doesn’t block writes.

OOM during query

If a vector search OOMs, it’s almost always work_mem being too small for a sort and the iterative scan being too greedy. Lower hnsw.max_scan_tuples and check whether your filter selectivity is what you think.

Wrapping Up

A billion vectors in Postgres 17 with pgvector 0.8 is not a stunt. It’s a production-ready architecture if you respect the constraints: quantize aggressively, partition for operations, build indexes carefully, and measure recall continuously. The “you need a vector database past 100 million vectors” advice is a year out of date.

What’s next: if you’re combining vectors with structured filters at scale, look into Citus for sharding across multiple Postgres nodes. I covered that in the Citus sharded multi-tenant Postgres walkthrough.