background-shape
Hybrid Retrieval with pgvector and BM25, A Practical Walkthrough
November 16, 2023 · 7 min read · by Muhammad Amal ai

TL;DR — Hybrid retrieval (dense + sparse) consistently beats pure vector search on internal knowledge bases by 15-25 points on hit-rate. / Postgres with pgvector 0.5 and built-in full-text search gets you most of the way without a separate vector database. / Reciprocal Rank Fusion is the boring, correct way to combine the two score systems.

In my first RAG post this month I claimed hybrid retrieval was the single biggest quality improvement available to most teams. A few people asked for the implementation details. This post is the walkthrough.

I’ll use Postgres 15 with pgvector 0.5 and Postgres’s built-in tsvector full-text search. The reason for the Postgres-only stack is operational: most teams already run Postgres, and bringing in a second datastore for vectors is a real ongoing cost that most internal tools cannot justify. Pinecone and Weaviate are great. They are also another thing to monitor, back up, and pay for.

Why Hybrid Beats Pure Vector

Vector search captures semantic similarity well and exact lexical matches poorly. If a user asks “how do I configure the OIDC provider for the staging cluster?” and your wiki page is titled “Setting up OIDC for staging”, pure vector retrieval is fine.

But if the user asks “what’s the runbook for ERR_PAYMENTS_4012?” — that error code is a token the embedding model has never seen in this exact form. Its semantic neighborhood is noise. Meanwhile, the wiki page that explicitly contains the string ERR_PAYMENTS_4012 is one BM25 query away.

Real internal queries are a mix of these. Hybrid retrieval handles both. Pure vector handles neither well across the full distribution.

The Schema

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE chunks (
    id BIGSERIAL PRIMARY KEY,
    doc_id TEXT NOT NULL,
    chunk_index INT NOT NULL,
    content TEXT NOT NULL,
    metadata JSONB NOT NULL DEFAULT '{}',
    embedding vector(1536) NOT NULL,
    tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
    created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX chunks_embedding_idx
    ON chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

CREATE INDEX chunks_tsv_idx ON chunks USING GIN (tsv);

CREATE INDEX chunks_metadata_idx ON chunks USING GIN (metadata);

A few notes worth dwelling on.

The HNSW index is new in pgvector 0.5. If you’re on 0.4, you only have IVFFlat, which requires training on representative data and degrades when the corpus grows. HNSW does not. Upgrade.

The tsvector column is a generated column, automatically maintained from the content. You don’t have to remember to update it. The GIN index makes BM25-style queries fast.

The JSONB metadata column carries everything else: document type, team, ACL groups, source URL, version hash. GIN index on it lets you filter by metadata fields efficiently.

The Hybrid Query

The pattern is: run both searches, take top-N from each, fuse the rankings, return top-K from the fused list.

WITH semantic_search AS (
    SELECT
        id,
        content,
        metadata,
        RANK() OVER (ORDER BY embedding <=> $1::vector) AS rank
    FROM chunks
    WHERE metadata @> $2::jsonb
    ORDER BY embedding <=> $1::vector
    LIMIT 20
),
keyword_search AS (
    SELECT
        id,
        content,
        metadata,
        RANK() OVER (ORDER BY ts_rank_cd(tsv, websearch_to_tsquery('english', $3)) DESC) AS rank
    FROM chunks
    WHERE tsv @@ websearch_to_tsquery('english', $3)
      AND metadata @> $2::jsonb
    ORDER BY ts_rank_cd(tsv, websearch_to_tsquery('english', $3)) DESC
    LIMIT 20
)
SELECT
    COALESCE(s.id, k.id) AS id,
    COALESCE(s.content, k.content) AS content,
    COALESCE(s.metadata, k.metadata) AS metadata,
    COALESCE(1.0 / (60 + s.rank), 0) + COALESCE(1.0 / (60 + k.rank), 0) AS rrf_score
FROM semantic_search s
FULL OUTER JOIN keyword_search k ON s.id = k.id
ORDER BY rrf_score DESC
LIMIT 10;

$1 is the query embedding, $2 is the metadata filter (e.g., {"team": "payments"}), $3 is the raw text query for full-text search.

websearch_to_tsquery is the friendlier query parser. It handles quoted phrases and the OR keyword in natural language input without forcing your users to learn Postgres tsquery syntax.

The constant 60 in the RRF formula is the convention from the original Reciprocal Rank Fusion paper. It dampens the contribution of items past the top few. You can tune it but I’ve never seen meaningful improvement from doing so.

Why RRF and Not Score Normalization

You might think: just normalize the cosine similarity scores and the BM25 scores into [0, 1] and add them with weights. People do this. It’s worse.

Cosine similarity scores cluster in a narrow range (most relevant chunks are 0.78-0.89 for ada-002 in my experience). BM25 scores have a long tail. Normalizing across batches is unstable because the score distributions depend on the query. Weighting becomes a hyperparameter you have to retune as the corpus grows.

RRF uses only the rank, not the score. It’s distribution-agnostic. It’s also boring, robust, and well-understood. Start here. If you have a specific reason to do something fancier later, you’ll know what it is.

Calling It From Python

LlamaIndex 0.8 doesn’t have a first-class hybrid retriever for pgvector out of the box. You can write a custom retriever:

# llama-index==0.8.68, psycopg2-binary==2.9.9
from llama_index.retrievers import BaseRetriever
from llama_index.schema import NodeWithScore, TextNode
import psycopg2
from psycopg2.extras import Json

class PGHybridRetriever(BaseRetriever):
    def __init__(self, embed_model, conn_str, top_k=10):
        self.embed_model = embed_model
        self.conn_str = conn_str
        self.top_k = top_k

    def _retrieve(self, query_bundle):
        embedding = self.embed_model.get_query_embedding(query_bundle.query_str)
        metadata_filter = query_bundle.custom_embedding_strs or {}

        with psycopg2.connect(self.conn_str) as conn:
            with conn.cursor() as cur:
                cur.execute(HYBRID_SQL, (
                    embedding,
                    Json(metadata_filter),
                    query_bundle.query_str,
                ))
                rows = cur.fetchall()

        return [
            NodeWithScore(
                node=TextNode(text=row[1], metadata=row[2]),
                score=row[3],
            )
            for row in rows[:self.top_k]
        ]

Wrap it in the standard query engine pipeline and you’re done. The benchmark numbers I shared in the LlamaIndex post came from a setup essentially identical to this one.

When Postgres Isn’t Enough

I want to be honest about where this stack hits limits.

Past about 5 million chunks, HNSW build times on a single Postgres instance become painful. You can throw hardware at it. You can partition. At some point a purpose-built vector database with horizontal scaling pays off.

If you need cross-region replication of the vector index with low latency, Postgres logical replication of HNSW indexes is rough. Most managed vector databases handle this better.

If you need approximate nearest neighbor with sub-10ms p99 at scale, dedicated systems are tuned for it more than pgvector is. For internal tools serving low-QPS traffic, pgvector is comfortably fast enough.

For a corpus of a few hundred thousand chunks, served to a few hundred internal users, Postgres is the right answer. The pgvector repo is well-maintained and the upgrade path stays open if you outgrow it.

Common Pitfalls

Forgetting to ANALYZE after bulk ingest. Postgres’s query planner needs statistics to choose between the HNSW scan and a sequential scan with filtering. Run ANALYZE chunks; after large ingests.

Cosine vs. L2 vs. inner product. OpenAI embeddings are normalized; cosine and inner product give equivalent rankings, but the operator matters for index selection. vector_cosine_ops matches <=> in queries. Mismatching them silently disables the index.

Token language for tsvector. I’m using 'english'. If your content is multilingual, you’ll need 'simple' or per-document language detection. 'simple' is also fine if your corpus is largely identifiers and code where stemming hurts more than it helps.

HNSW parameters. m=16, ef_construction=64 are reasonable defaults. Higher ef_construction improves recall at the cost of build time. Set hnsw.ef_search at query time if you want to trade recall for latency.

Metadata filter selectivity. If your filter is highly selective (e.g., team=payments returns 2% of corpus), the planner may prefer a scan over the HNSW index. Generally fine. If it’s the wrong choice, you can force ordering with planner hints or by structuring the query differently.

Wrapping Up

Hybrid retrieval is the single best lever you have on RAG quality and Postgres is a credible place to implement it. Schema is straightforward, query is fifteen lines of SQL, fusion is a one-liner. The operational story — backups, monitoring, replication — is the same as any other Postgres workload, which is to say: solved.

If you’re choosing infrastructure for an internal RAG system this month, start here. Move to a dedicated vector store when you have concrete evidence you need to.

What’s Next

Next I want to write up the evaluation side properly — building a retrieval eval set, what metrics to track, and how to wire it into CI so a model or chunking change doesn’t silently regress quality.