Re-ranking and Reciprocal Rank Fusion in RAG Pipelines
TL;DR — A re-ranker over the top-50 from retrieval cleans up most of what hybrid search alone misses. Cohere Rerank 3 is the easy commercial pick; BGE-reranker-v2-m3 is the open-source one. Both add 100-300ms but are the highest-leverage precision boost in the stack.
In the hybrid search post I waved at re-ranking as the next layer. Today’s the deep dive. Re-ranking is the single highest-leverage precision improvement available to a 2024 RAG pipeline and one of the most under-deployed pieces. If you’ve optimized chunking and hybrid retrieval and you’re still hitting an accuracy ceiling, this is probably your gap.
Bi-encoders vs cross-encoders
The reason re-ranking exists is the asymmetry between bi-encoders and cross-encoders.
A bi-encoder (your embedding model) processes the query and the document independently. It produces a vector for each. Similarity is a cheap operation on the vectors. You can index millions of document vectors and search them in milliseconds. This is what your vector DB does.
A cross-encoder processes the query and a document together — concatenated as one input — and produces a single relevance score. It sees the interaction between specific query terms and specific document terms. It’s a different and richer signal. It’s also expensive: O(N) inferences per query for N candidates, with no way to precompute.
The combination that works in production: use the bi-encoder to retrieve a top-N candidate set (say N=50 or N=100); use the cross-encoder to re-rank those candidates and keep top-K (say K=5) for the LLM.
You get the speed of vector search (only the top-N go to the slow model) and the precision of cross-encoding (only the top-K go to the LLM).
What re-ranking buys you
Numbers from real systems I’ve worked on:
- Internal documentation RAG, ~200K chunks. Hybrid retrieval top-5 precision: 78%. Hybrid + Cohere Rerank 3 top-5: 91%. Latency cost: ~140ms.
- Code search, ~1M code chunks. Hybrid top-10 NDCG: 0.67. With BGE-reranker-v2-m3: 0.81. Latency cost: ~280ms (CPU inference, no GPU).
- Customer support corpus, ~50K chunks. Hybrid faithfulness on Ragas: 0.72. With Cohere Rerank: 0.86.
The pattern: a 10-15 point bump in precision metrics, with a 100-300ms latency tax. For a typical RAG query that already takes 1-3 seconds (retrieval + LLM), that’s a worthwhile trade.
Cohere Rerank 3
Cohere’s Rerank 3 (released late 2023, available through their API and AWS Bedrock) is what I default to when the project allows commercial APIs:
# Cohere Rerank 3 — February 2024
import cohere
co = cohere.Client(api_key=COHERE_API_KEY)
def rerank(query, candidates, top_n=5):
"""candidates: list of strings (the chunk texts)"""
response = co.rerank(
model="rerank-3.0",
query=query,
documents=candidates,
top_n=top_n,
)
return [
{"index": r.index, "score": r.relevance_score, "text": candidates[r.index]}
for r in response.results
]
What you get:
- Best-in-class accuracy on most benchmarks I’ve tested
- Multilingual support out of the box
- Reasonable rate limits and latency from Cohere’s hosted API
- Available through AWS Bedrock if you need that path for compliance
What you pay:
- $2 per 1000 search units (1 query × up to 100 docs). For a system doing 10K queries/day with 50-doc rerank, that’s $1 per day. Cheap.
- API dependency. Same compliance and latency calculus as any external API.
BGE-reranker-v2-m3
BAAI’s BGE-reranker-v2-m3 (late 2023) is the open-source counterpart. Multi-lingual, similar architecture and accuracy profile to Cohere’s offering, runnable on CPU for low traffic or GPU for production scale.
# BGE-reranker-v2-m3 — self-hosted
from FlagEmbedding import FlagReranker
reranker = FlagReranker("BAAI/bge-reranker-v2-m3", use_fp16=True)
def rerank(query, candidates, top_n=5):
pairs = [[query, c] for c in candidates]
scores = reranker.compute_score(pairs)
ranked = sorted(
zip(range(len(candidates)), scores, candidates),
key=lambda x: x[1],
reverse=True,
)
return [
{"index": idx, "score": s, "text": text}
for idx, s, text in ranked[:top_n]
]
On an A10G GPU, this reranks 50 candidates in ~80ms. On CPU, ~400ms. For a moderate-traffic system, a small GPU pod handles the load.
I run BGE-reranker when:
- Data can’t leave the network (compliance).
- Costs at scale tip toward self-hosting (consistently high QPS).
- I want the same model for embeddings and re-ranking — the m3 family pairs naturally with
bge-m3embeddings.
Pipeline shape
The composition that’s become standard:
query
↓
embed (dense) + sparse_encode
↓
hybrid_retrieval(top_n=50) ← bi-encoder
↓
rerank(top_k=5) ← cross-encoder
↓
prompt with top_k chunks
↓
LLM
↓
answer + citations
The retrieval stage casts a wide net (top-50). The re-ranker tightens it (top-5). The LLM only sees the cleanest candidates. Each stage is doing what it’s best at.
# End-to-end RAG with hybrid + rerank — February 2024
def answer(query, user):
dense_vec = embed_model.encode(query)
sparse_idx, sparse_val = sparse_encode(query)
candidates = hybrid_search(
query, dense_vec, sparse_idx, sparse_val,
top_k=50,
acl_filter=user.acl_filter,
)
reranked = rerank(query, [c.text for c in candidates], top_n=5)
top_chunks = [candidates[r["index"]] for r in reranked]
response = llm.complete(
system=SYSTEM_PROMPT,
prompt=build_prompt(query, top_chunks),
)
audit_log(user, query, top_chunks, response)
return response, top_chunks
That’s the production shape. Hybrid for recall, re-ranker for precision, ACLs at retrieval, audit on top.
Where re-ranking goes wrong
Three failure modes I’ve seen:
Reranking too few candidates. If your bi-encoder returns top-10 and you re-rank those, the re-ranker can only choose from what was already ranked. The whole point is for re-ranking to fix mistakes the bi-encoder made — give it 50 to choose from, not 10.
Reranking too many. Reranking 1000 candidates is expensive and gives diminishing returns. 50-100 hits the sweet spot for most workloads. Above that, marginal precision gains aren’t worth the latency.
Mixing rerankers and embedding models incoherently. Some embedding/reranker pairs are tuned together (Cohere embed + Cohere Rerank; bge-m3 embed + bge-reranker-v2-m3). Some aren’t. The integration testing matters.
When to skip re-ranking
A few cases where it’s not worth the latency:
- Very short corpora (<1K chunks) where retrieval is already near-perfect.
- Conversational chatbot use cases with tight latency budgets — re-ranking adds 100-300ms which compounds in multi-turn flows.
- Cost-sensitive prototypes where hybrid alone hits “good enough” for the demo.
For anything production-grade with >50K chunks and accuracy-sensitive users, re-ranking pays for itself.
Common Pitfalls
- Re-ranking on a tiny candidate set. Make sure you’re casting a wide net at the bi-encoder stage. Top-50 to the reranker is a reasonable default.
- Picking a reranker without a multilingual story. If your corpus has multiple languages, English-only rerankers underperform badly. m3-family or Cohere multilingual.
- Skipping the eval. A reranker that helps on Benchmark X may regress your domain. Build a small labeled eval set and measure.
- Treating re-ranker scores as calibrated probabilities. They aren’t. A score of 0.7 vs 0.6 is meaningful as relative ranking; the absolute value isn’t.
- Batching incorrectly. API-based rerankers batch by query, not by candidate. Sending 50 candidates is one API call.
Wrapping Up
Re-ranking is the cheapest precision boost in the modern RAG stack. The two solid 2024 options — Cohere Rerank 3 commercial, BGE-reranker-v2-m3 open-source — both add 10-15 points to top-K precision for 100-300ms of latency. There’s not much reason not to do it.
Last post of the month is the one that lets you measure all of this — evaluation frameworks that turn “vibes” into numbers. The Cohere Rerank docs cover their API in detail if you want the canonical reference.