background-shape
Re-ranking and Reciprocal Rank Fusion in RAG Pipelines
February 21, 2024 · 6 min read · by Muhammad Amal ai

TL;DR — A re-ranker over the top-50 from retrieval cleans up most of what hybrid search alone misses. Cohere Rerank 3 is the easy commercial pick; BGE-reranker-v2-m3 is the open-source one. Both add 100-300ms but are the highest-leverage precision boost in the stack.

In the hybrid search post I waved at re-ranking as the next layer. Today’s the deep dive. Re-ranking is the single highest-leverage precision improvement available to a 2024 RAG pipeline and one of the most under-deployed pieces. If you’ve optimized chunking and hybrid retrieval and you’re still hitting an accuracy ceiling, this is probably your gap.

Bi-encoders vs cross-encoders

The reason re-ranking exists is the asymmetry between bi-encoders and cross-encoders.

A bi-encoder (your embedding model) processes the query and the document independently. It produces a vector for each. Similarity is a cheap operation on the vectors. You can index millions of document vectors and search them in milliseconds. This is what your vector DB does.

A cross-encoder processes the query and a document together — concatenated as one input — and produces a single relevance score. It sees the interaction between specific query terms and specific document terms. It’s a different and richer signal. It’s also expensive: O(N) inferences per query for N candidates, with no way to precompute.

The combination that works in production: use the bi-encoder to retrieve a top-N candidate set (say N=50 or N=100); use the cross-encoder to re-rank those candidates and keep top-K (say K=5) for the LLM.

You get the speed of vector search (only the top-N go to the slow model) and the precision of cross-encoding (only the top-K go to the LLM).

What re-ranking buys you

Numbers from real systems I’ve worked on:

  • Internal documentation RAG, ~200K chunks. Hybrid retrieval top-5 precision: 78%. Hybrid + Cohere Rerank 3 top-5: 91%. Latency cost: ~140ms.
  • Code search, ~1M code chunks. Hybrid top-10 NDCG: 0.67. With BGE-reranker-v2-m3: 0.81. Latency cost: ~280ms (CPU inference, no GPU).
  • Customer support corpus, ~50K chunks. Hybrid faithfulness on Ragas: 0.72. With Cohere Rerank: 0.86.

The pattern: a 10-15 point bump in precision metrics, with a 100-300ms latency tax. For a typical RAG query that already takes 1-3 seconds (retrieval + LLM), that’s a worthwhile trade.

Cohere Rerank 3

Cohere’s Rerank 3 (released late 2023, available through their API and AWS Bedrock) is what I default to when the project allows commercial APIs:

# Cohere Rerank 3 — February 2024
import cohere

co = cohere.Client(api_key=COHERE_API_KEY)

def rerank(query, candidates, top_n=5):
    """candidates: list of strings (the chunk texts)"""
    response = co.rerank(
        model="rerank-3.0",
        query=query,
        documents=candidates,
        top_n=top_n,
    )
    return [
        {"index": r.index, "score": r.relevance_score, "text": candidates[r.index]}
        for r in response.results
    ]

What you get:

  • Best-in-class accuracy on most benchmarks I’ve tested
  • Multilingual support out of the box
  • Reasonable rate limits and latency from Cohere’s hosted API
  • Available through AWS Bedrock if you need that path for compliance

What you pay:

  • $2 per 1000 search units (1 query × up to 100 docs). For a system doing 10K queries/day with 50-doc rerank, that’s $1 per day. Cheap.
  • API dependency. Same compliance and latency calculus as any external API.

BGE-reranker-v2-m3

BAAI’s BGE-reranker-v2-m3 (late 2023) is the open-source counterpart. Multi-lingual, similar architecture and accuracy profile to Cohere’s offering, runnable on CPU for low traffic or GPU for production scale.

# BGE-reranker-v2-m3 — self-hosted
from FlagEmbedding import FlagReranker

reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)

def rerank(query, candidates, top_n=5):
    pairs = [[query, c] for c in candidates]
    scores = reranker.compute_score(pairs)
    ranked = sorted(
        zip(range(len(candidates)), scores, candidates),
        key=lambda x: x[1],
        reverse=True,
    )
    return [
        {"index": idx, "score": s, "text": text}
        for idx, s, text in ranked[:top_n]
    ]

On an A10G GPU, this reranks 50 candidates in ~80ms. On CPU, ~400ms. For a moderate-traffic system, a small GPU pod handles the load.

I run BGE-reranker when:

  • Data can’t leave the network (compliance).
  • Costs at scale tip toward self-hosting (consistently high QPS).
  • I want the same model for embeddings and re-ranking — the m3 family pairs naturally with bge-m3 embeddings.

Pipeline shape

The composition that’s become standard:

query
embed (dense) + sparse_encode
hybrid_retrieval(top_n=50)   ← bi-encoder
rerank(top_k=5)              ← cross-encoder
prompt with top_k chunks
LLM
answer + citations

The retrieval stage casts a wide net (top-50). The re-ranker tightens it (top-5). The LLM only sees the cleanest candidates. Each stage is doing what it’s best at.

# End-to-end RAG with hybrid + rerank — February 2024
def answer(query, user):
    dense_vec = embed_model.encode(query)
    sparse_idx, sparse_val = sparse_encode(query)

    candidates = hybrid_search(
        query, dense_vec, sparse_idx, sparse_val,
        top_k=50,
        acl_filter=user.acl_filter,
    )
    reranked = rerank(query, [c.text for c in candidates], top_n=5)
    top_chunks = [candidates[r["index"]] for r in reranked]

    response = llm.complete(
        system=SYSTEM_PROMPT,
        prompt=build_prompt(query, top_chunks),
    )
    audit_log(user, query, top_chunks, response)
    return response, top_chunks

That’s the production shape. Hybrid for recall, re-ranker for precision, ACLs at retrieval, audit on top.

Where re-ranking goes wrong

Three failure modes I’ve seen:

Reranking too few candidates. If your bi-encoder returns top-10 and you re-rank those, the re-ranker can only choose from what was already ranked. The whole point is for re-ranking to fix mistakes the bi-encoder made — give it 50 to choose from, not 10.

Reranking too many. Reranking 1000 candidates is expensive and gives diminishing returns. 50-100 hits the sweet spot for most workloads. Above that, marginal precision gains aren’t worth the latency.

Mixing rerankers and embedding models incoherently. Some embedding/reranker pairs are tuned together (Cohere embed + Cohere Rerank; bge-m3 embed + bge-reranker-v2-m3). Some aren’t. The integration testing matters.

When to skip re-ranking

A few cases where it’s not worth the latency:

  • Very short corpora (<1K chunks) where retrieval is already near-perfect.
  • Conversational chatbot use cases with tight latency budgets — re-ranking adds 100-300ms which compounds in multi-turn flows.
  • Cost-sensitive prototypes where hybrid alone hits “good enough” for the demo.

For anything production-grade with >50K chunks and accuracy-sensitive users, re-ranking pays for itself.

Common Pitfalls

  • Re-ranking on a tiny candidate set. Make sure you’re casting a wide net at the bi-encoder stage. Top-50 to the reranker is a reasonable default.
  • Picking a reranker without a multilingual story. If your corpus has multiple languages, English-only rerankers underperform badly. m3-family or Cohere multilingual.
  • Skipping the eval. A reranker that helps on Benchmark X may regress your domain. Build a small labeled eval set and measure.
  • Treating re-ranker scores as calibrated probabilities. They aren’t. A score of 0.7 vs 0.6 is meaningful as relative ranking; the absolute value isn’t.
  • Batching incorrectly. API-based rerankers batch by query, not by candidate. Sending 50 candidates is one API call.

Wrapping Up

Re-ranking is the cheapest precision boost in the modern RAG stack. The two solid 2024 options — Cohere Rerank 3 commercial, BGE-reranker-v2-m3 open-source — both add 10-15 points to top-K precision for 100-300ms of latency. There’s not much reason not to do it.

Last post of the month is the one that lets you measure all of this — evaluation frameworks that turn “vibes” into numbers. The Cohere Rerank docs cover their API in detail if you want the canonical reference.