Re-ranking and Reciprocal Rank Fusion in RAG Pipelines

Rag article cover illustration on a gradient background

February 21, 2024 · 6 min read · by Muhammad Amal ai

TL;DR — A re-ranker over the top-50 from retrieval cleans up most of what hybrid search alone misses. Cohere Rerank 3 is the easy commercial pick; BGE-reranker-v2-m3 is the open-source one. Both add 100-300ms but are the highest-leverage precision boost in the stack.

In the hybrid search post I waved at re-ranking as the next layer. Today’s the deep dive. Re-ranking is the single highest-leverage precision improvement available to a 2024 RAG pipeline and one of the most under-deployed pieces. If you’ve optimized chunking and hybrid retrieval and you’re still hitting an accuracy ceiling, this is probably your gap.

Bi-encoders vs cross-encoders

The reason re-ranking exists is the asymmetry between bi-encoders and cross-encoders.

A bi-encoder (your embedding model) processes the query and the document independently. It produces a vector for each. Similarity is a cheap operation on the vectors. You can index millions of document vectors and search them in milliseconds. This is what your vector DB does.

A cross-encoder processes the query and a document together — concatenated as one input — and produces a single relevance score. It sees the interaction between specific query terms and specific document terms. It’s a different and richer signal. It’s also expensive: O(N) inferences per query for N candidates, with no way to precompute.

The combination that works in production: use the bi-encoder to retrieve a top-N candidate set (say N=50 or N=100); use the cross-encoder to re-rank those candidates and keep top-K (say K=5) for the LLM.

You get the speed of vector search (only the top-N go to the slow model) and the precision of cross-encoding (only the top-K go to the LLM).

What re-ranking buys you

Numbers from real systems I’ve worked on:

Internal documentation RAG, ~200K chunks. Hybrid retrieval top-5 precision: 78%. Hybrid + Cohere Rerank 3 top-5: 91%. Latency cost: ~140ms.
Code search, ~1M code chunks. Hybrid top-10 NDCG: 0.67. With BGE-reranker-v2-m3: 0.81. Latency cost: ~280ms (CPU inference, no GPU).
Customer support corpus, ~50K chunks. Hybrid faithfulness on Ragas: 0.72. With Cohere Rerank: 0.86.

The pattern: a 10-15 point bump in precision metrics, with a 100-300ms latency tax. For a typical RAG query that already takes 1-3 seconds (retrieval + LLM), that’s a worthwhile trade.

Cohere Rerank 3

Cohere’s Rerank 3 (released late 2023, available through their API and AWS Bedrock) is what I default to when the project allows commercial APIs:

# Cohere Rerank 3 — February 2024
import cohere

co = cohere.Client(api_key=COHERE_API_KEY)

def rerank(query, candidates, top_n=5):
    """candidates: list of strings (the chunk texts)"""
    response = co.rerank(
        model="rerank-3.0",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [
        {"index": r.index, "score": r.relevance_score, "text": candidates[r.index]}
        for r in response.results
    ]

What you get:

Best-in-class accuracy on most benchmarks I’ve tested
Multilingual support out of the box
Reasonable rate limits and latency from Cohere’s hosted API
Available through AWS Bedrock if you need that path for compliance

What you pay:

$2 per 1000 search units (1 query × up to 100 docs). For a system doing 10K queries/day with 50-doc rerank, that’s $1 per day. Cheap.
API dependency. Same compliance and latency calculus as any external API.

BGE-reranker-v2-m3

BAAI’s BGE-reranker-v2-m3 (late 2023) is the open-source counterpart. Multi-lingual, similar architecture and accuracy profile to Cohere’s offering, runnable on CPU for low traffic or GPU for production scale.

# BGE-reranker-v2-m3 — self-hosted
from FlagEmbedding import FlagReranker

reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

def rerank(query, candidates, top_n=5):
    pairs = [[query, c] for c in candidates]
    scores = reranker.compute_score(pairs)
    ranked = sorted(
        zip(range(len(candidates)), scores, candidates),
        key=lambda x: x[1],
        reverse=True,
    )
    return [
        {"index": idx, "score": s, "text": text}
        for idx, s, text in ranked[:top_n]
    ]

On an A10G GPU, this reranks 50 candidates in ~80ms. On CPU, ~400ms. For a moderate-traffic system, a small GPU pod handles the load.

I run BGE-reranker when:

Data can’t leave the network (compliance).
Costs at scale tip toward self-hosting (consistently high QPS).
I want the same model for embeddings and re-ranking — the m3 family pairs naturally with bge-m3 embeddings.

Pipeline shape

The composition that’s become standard:

query
  ↓
embed (dense) + sparse_encode
  ↓
hybrid_retrieval(top_n=50)   ← bi-encoder
  ↓
rerank(top_k=5)              ← cross-encoder
  ↓
prompt with top_k chunks
  ↓
LLM
  ↓
answer + citations

The retrieval stage casts a wide net (top-50). The re-ranker tightens it (top-5). The LLM only sees the cleanest candidates. Each stage is doing what it’s best at.

# End-to-end RAG with hybrid + rerank — February 2024
def answer(query, user):
    dense_vec = embed_model.encode(query)
    sparse_idx, sparse_val = sparse_encode(query)

    candidates = hybrid_search(
        query, dense_vec, sparse_idx, sparse_val,
        top_k=50,
        acl_filter=user.acl_filter,
    )
    reranked = rerank(query, [c.text for c in candidates], top_n=5)
    top_chunks = [candidates[r["index"]] for r in reranked]

    response = llm.complete(
        system=SYSTEM_PROMPT,
        prompt=build_prompt(query, top_chunks),
    )
    audit_log(user, query, top_chunks, response)
    return response, top_chunks

That’s the production shape. Hybrid for recall, re-ranker for precision, ACLs at retrieval, audit on top.

Where re-ranking goes wrong

Three failure modes I’ve seen:

Reranking too few candidates. If your bi-encoder returns top-10 and you re-rank those, the re-ranker can only choose from what was already ranked. The whole point is for re-ranking to fix mistakes the bi-encoder made — give it 50 to choose from, not 10.

Reranking too many. Reranking 1000 candidates is expensive and gives diminishing returns. 50-100 hits the sweet spot for most workloads. Above that, marginal precision gains aren’t worth the latency.

Mixing rerankers and embedding models incoherently. Some embedding/reranker pairs are tuned together (Cohere embed + Cohere Rerank; bge-m3 embed + bge-reranker-v2-m3). Some aren’t. The integration testing matters.

When to skip re-ranking

A few cases where it’s not worth the latency:

Very short corpora (<1K chunks) where retrieval is already near-perfect.
Conversational chatbot use cases with tight latency budgets — re-ranking adds 100-300ms which compounds in multi-turn flows.
Cost-sensitive prototypes where hybrid alone hits “good enough” for the demo.

For anything production-grade with >50K chunks and accuracy-sensitive users, re-ranking pays for itself.

Common Pitfalls

Re-ranking on a tiny candidate set. Make sure you’re casting a wide net at the bi-encoder stage. Top-50 to the reranker is a reasonable default.
Picking a reranker without a multilingual story. If your corpus has multiple languages, English-only rerankers underperform badly. m3-family or Cohere multilingual.
Skipping the eval. A reranker that helps on Benchmark X may regress your domain. Build a small labeled eval set and measure.
Treating re-ranker scores as calibrated probabilities. They aren’t. A score of 0.7 vs 0.6 is meaningful as relative ranking; the absolute value isn’t.
Batching incorrectly. API-based rerankers batch by query, not by candidate. Sending 50 candidates is one API call.

Wrapping Up

Re-ranking is the cheapest precision boost in the modern RAG stack. The two solid 2024 options — Cohere Rerank 3 commercial, BGE-reranker-v2-m3 open-source — both add 10-15 points to top-K precision for 100-300ms of latency. There’s not much reason not to do it.

Last post of the month is the one that lets you measure all of this — evaluation frameworks that turn “vibes” into numbers. The Cohere Rerank docs cover their API in detail if you want the canonical reference.

Bi-encoders vs cross-encoders

What re-ranking buys you

Cohere Rerank 3

BGE-reranker-v2-m3

Pipeline shape

Where re-ranking goes wrong

When to skip re-ranking

Common Pitfalls

Wrapping Up

Related posts

Embedding Models in 2024, OpenAI vs Cohere vs Open Source

Evaluating RAG, Beyond Vibes-Based Testing

Securing RAG, Per-User Document Access Without Re-indexing

Hybrid Search, BM25 Plus Vectors for Better RAG Recall

Chunking Strategies for RAG That Survive Real Documents

Choosing a Vector Database, Pinecone vs Qdrant vs pgvector

Why Naive RAG Fails in Production, A 2024 Reality Check

Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

Let’s Start a Project