Hybrid Search, BM25 Plus Vectors for Better RAG Recall

February 14, 2024 · 6 min read · by Muhammad Amal ai

TL;DR — Pure vector search loses to BM25 on exact-match queries and acronyms. Pure BM25 loses on paraphrases. Combining both via Reciprocal Rank Fusion catches what either misses. Cheap to add, hard to skip in 2024.

When RAG broke for me the first time in production, the failing queries shared a pattern: they contained specific identifiers — error codes, product SKUs, version numbers — that the embedding model treated as ordinary tokens. The vector search ranked semantically similar but not exact matches above the actual answers. Sound familiar?

This is the BM25-vs-vector gap. In the framing post I flagged it. Today it gets the full treatment.

What each side is good at

Vector search excels at:

Paraphrases. “How do I reset my password” and “I forgot my login credentials” map to similar vectors.
Conceptual queries. “Performance issues during peak hours” matches chunks about throughput, latency, scaling — even without those exact words.
Cross-language retrieval (with multilingual embedding models).

BM25 (or any keyword index) excels at:

Exact matches. “Error E_AUTH_002” hits chunks containing exactly that string.
Acronyms, product names, codes — terms the embedding model wasn’t well-trained on.
Rare terms. The IDF component of BM25 boosts terms that are uncommon in the corpus, which is exactly what you want for distinctive queries.

The intersection: most production RAG systems have a mix of query types. Some users ask natural-language questions; some search for specific identifiers. Trying to serve both with one retrieval method leaves recall on the table.

What RRF does

Reciprocal Rank Fusion is the standard merging strategy for multiple ranked lists. The formula is simple:

score(d) = sum over rankers of 1 / (k + rank_i(d))

Where k is a constant (usually 60) and rank_i(d) is document d’s rank in ranker i’s output. Documents not in a ranker’s top-K contribute zero from that ranker.

Why it works: it normalizes across rankers whose raw scores aren’t comparable (BM25 returns 0-30+, cosine returns 0-1). It rewards documents that appear high in multiple rankers. It doesn’t require tuning weights per ranker, which is the headache that kills naive score-merging.

# RRF in vanilla Python
def rrf(rankings, k=60):
    """rankings: list of lists of doc IDs, ordered by rank"""
    scores = {}
    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

That’s the whole algorithm. The trick is generating the input rankings well.

Implementation with Qdrant native hybrid

Qdrant v1.7 ships native hybrid search with RRF built in. The cleanest implementation in 2024 if you’re self-hosting:

# Qdrant v1.7 hybrid + RRF — February 2024
from qdrant_client import QdrantClient, models

client = QdrantClient(url="http://localhost:6333")

client.create_collection(
    collection_name="docs",
    vectors_config={
        "dense": models.VectorParams(size=1536, distance=models.Distance.COSINE),
    },
    sparse_vectors_config={"sparse": models.SparseVectorParams()},
)

def index_chunk(chunk_id, text, dense_vec, sparse_indices, sparse_values, metadata):
    client.upsert(
        collection_name="docs",
        points=[models.PointStruct(
            id=chunk_id,
            vector={
                "dense": dense_vec,
                "sparse": models.SparseVector(indices=sparse_indices, values=sparse_values),
            },
            payload=metadata,
        )],
    )

def hybrid_search(query, dense_vec, sparse_indices, sparse_values, top_k=10):
    return client.query_points(
        collection_name="docs",
        prefetch=[
            models.Prefetch(query=dense_vec, using="dense", limit=50),
            models.Prefetch(
                query=models.SparseVector(indices=sparse_indices, values=sparse_values),
                using="sparse",
                limit=50,
            ),
        ],
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=top_k,
    )

The sparse vector needs a generator. SPLADE is the obvious choice; bge-m3 (from the embedding post ) can also output sparse representations from the same forward pass.

Implementation with Elasticsearch or OpenSearch

If you’re already running Elasticsearch 8.12+ or OpenSearch 2.11+, hybrid is a built-in.

POST /docs/_search
{
  "retriever": {
    "rrf": {
      "retrievers": [
        {
          "standard": {
            "query": { "match": { "content": "tier-3 SLA" } }
          }
        },
        {
          "knn": {
            "field": "embedding",
            "query_vector": [/* 1536 floats */],
            "k": 50,
            "num_candidates": 100
          }
        }
      ],
      "rank_window_size": 50,
      "rank_constant": 60
    }
  },
  "size": 10
}

Elasticsearch 8.12 (January 2024) introduced the retriever API which includes native RRF. OpenSearch’s neural search plugin has equivalent functionality with slightly different syntax.

This is the path of least resistance if you already index documents in Elastic for full-text search; you just add a dense_vector field and you’re done.

Implementation with pgvector

pgvector doesn’t ship hybrid out of the box, but Postgres has tsvector for BM25-like full-text search. Roll your own RRF in SQL:

WITH dense AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY embedding <=> $1::vector) AS rnk
  FROM doc_chunks
  ORDER BY embedding <=> $1::vector
  LIMIT 50
),
sparse AS (
  SELECT id, ROW_NUMBER() OVER (ORDER BY ts_rank(content_tsv, query) DESC) AS rnk
  FROM doc_chunks, plainto_tsquery('english', $2) AS query
  WHERE content_tsv @@ query
  ORDER BY ts_rank(content_tsv, query) DESC
  LIMIT 50
)
SELECT doc.id, doc.content, SUM(1.0 / (60 + rnk)) AS rrf_score
FROM doc_chunks doc
LEFT JOIN dense ON doc.id = dense.id
LEFT JOIN sparse ON doc.id = sparse.id
WHERE dense.id IS NOT NULL OR sparse.id IS NOT NULL
GROUP BY doc.id, doc.content
ORDER BY rrf_score DESC
LIMIT 10;

Less elegant but it works, and you don’t add a new dependency. The content_tsv is a generated column of to_tsvector('english', content). Index it with a GIN index.

Numbers from the field

A few real-world results from systems I’ve shipped:

Engineering wiki RAG, ~500K chunks. Pure vector top-5 precision: 71%. BM25 top-5: 58%. Hybrid + RRF top-5: 81%. Hybrid wasn’t 71 + 58; it was 81 because each side caught different misses.
Customer support corpus, ~120K chunks. Vector-only: 67% answer-relevance on a Ragas eval. Hybrid: 78%. Most of the lift came from queries containing product names.
Legal document corpus, ~50K chunks of contracts. Vector-only struggled with section references (“Section 4.2(b)”). Hybrid caught these reliably.

You don’t need to take these numbers as gospel — your corpus will be different. Run the comparison on your eval set. The pattern usually holds: hybrid wins by 5-15% on top-K precision/recall.

When pure vector is enough

A few cases where the BM25 layer doesn’t earn its keep:

Your corpus is uniformly natural-language prose with no special tokens (product names, codes, IDs).
Queries are exclusively conversational, never identifier-based.
You’re cost- or complexity-constrained and 5% recall isn’t worth the operational tax of a second index.

These are minority cases. Most production systems benefit from hybrid.

Common Pitfalls

Tuning weights instead of using RRF. Weighted score-merging requires tuning per corpus. RRF is parameter-free and works almost as well. Start with RRF; only complicate if you have a specific reason.
Picking k carelessly. Higher k flattens RRF (less reward for top ranks). Start at 60, which is the empirical default from the original paper. Don’t tune unless your eval shows it helps.
Running BM25 over poorly-preprocessed text. Tokenization matters. Stemming and stop-word handling affect recall. Use the language analyzer that matches your corpus.
Skipping the sparse vector for hybrid in Qdrant. Hybrid in Qdrant uses sparse vectors, not a separate full-text index. Generate them with SPLADE or bge-m3.
Re-ranking on top of hybrid without re-evaluating. Adding a re-ranker after hybrid is great (next week’s post). Don’t assume the hybrid weights stay optimal once you add re-ranking; re-evaluate end-to-end.

Wrapping Up

Hybrid search is the cheapest 5-15% recall improvement in 2024 RAG. The implementations are widely available — Qdrant native, Elasticsearch’s retriever API, OpenSearch neural plugin, or a hand-rolled RRF in Postgres. There’s no reason not to do it.

Next post covers a thornier problem: how to do all of this with per-user access control without re-indexing per tenant. The Elasticsearch retriever docs have more on their flavor of RRF if you want the canonical reference.

What each side is good at

What RRF does

Implementation with Qdrant native hybrid

Implementation with Elasticsearch or OpenSearch

Implementation with pgvector

Numbers from the field

When pure vector is enough

Common Pitfalls

Wrapping Up

Related posts

Choosing a Vector Database, Pinecone vs Qdrant vs pgvector

Evaluating RAG, Beyond Vibes-Based Testing

Re-ranking and Reciprocal Rank Fusion in RAG Pipelines

Securing RAG, Per-User Document Access Without Re-indexing

Chunking Strategies for RAG That Survive Real Documents

Embedding Models in 2024, OpenAI vs Cohere vs Open Source

Why Naive RAG Fails in Production, A 2024 Reality Check

Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

Let’s Start a Project