Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

Putting a RAG Evaluation Pipeline in CI, The Setup I Actually Use

November 20, 2023 · 7 min read · by Muhammad Amal ai

TL;DR — A RAG system without an eval pipeline is a vibe-driven system. Build the eval before you optimize the model. / Split retrieval metrics (hit-rate, MRR) from generation metrics (faithfulness, answer correctness) — they fail for different reasons. / Run evals in CI. Make a regression in retrieval recall fail the build the same way a unit test does.

In the four posts I’ve written this month I’ve talked around evaluation a lot without showing the setup. The omission is unfair because evaluation is the single most underrated practice in production LLM work.

Without an eval pipeline, every prompt change is a guess. Every chunking experiment is a vibe check. Every model migration is a leap of faith. With an eval pipeline, you get the same boring confidence you have when a unit test suite passes. You can move fast because you can detect when you’ve broken something.

This post is the eval setup I currently use, in enough detail that you can copy it. It’s deliberately not the most sophisticated thing you can build. It’s the simplest thing that catches the regressions that actually happen.

The Two Failure Modes

A RAG system fails in two distinct ways. Retrieval can fail to surface the relevant document. Generation can fail to answer correctly given the right document. The metrics, the fixes, and the eval datasets are different.

When I look at a failing answer, the first question is always: did the right chunk reach the model? If yes, the generation step messed up — the prompt, the model, the chunk packing. If no, the retriever messed up — the chunking, the embedding, the filter.

You need to be able to answer this question automatically across hundreds of examples. That’s what the eval pipeline does.

The Golden Dataset

Start with 50 questions. Not 500. Not 5000. Fifty hand-curated questions, each paired with the canonical document or chunk that contains the answer, and a reference answer written by a human.

# evals/golden.py
from dataclasses import dataclass

@dataclass
class GoldenQuestion:
    id: str
    question: str
    expected_doc_ids: list[str]
    reference_answer: str
    tags: list[str]  # e.g., ["runbook", "payments"]

GOLDEN = [
    GoldenQuestion(
        id="gq-001",
        question="What's the on-call rotation for the payments team?",
        expected_doc_ids=["wiki/payments-oncall"],
        reference_answer="Payments on-call rotates weekly on Mondays at 09:00 UTC, configured in PagerDuty schedule 'payments-primary'.",
        tags=["oncall", "payments"],
    ),
    # ...
]

How do you get the 50 questions? Sit with the users. Watch what they ask. Mine the existing Slack channels where people ask the same questions that the bot is supposed to answer. The questions must be real, not synthetic. Synthetic questions from an LLM look plausible and don’t catch the distribution of real user behavior.

Tag them. When evals start failing on a tag, you know where to look. Tags for doc_type, team, and complexity (single-hop vs. multi-hop) have been the most useful for me.

Retrieval Metrics

Two metrics, both standard: Hit@K and Mean Reciprocal Rank.

def hit_at_k(retrieved_ids: list[str], expected_ids: list[str], k: int) -> float:
    top_k = set(retrieved_ids[:k])
    return 1.0 if any(eid in top_k for eid in expected_ids) else 0.0

def reciprocal_rank(retrieved_ids: list[str], expected_ids: list[str]) -> float:
    for i, rid in enumerate(retrieved_ids):
        if rid in expected_ids:
            return 1.0 / (i + 1)
    return 0.0

Average across the golden set. For our internal bot the target is Hit@5 >= 0.85 and MRR@10 >= 0.70. The numbers will vary by corpus. Whatever they are for yours, write them down and treat regressions seriously.

This is also the metric that catches most chunking bugs. If you change your chunker and hit-rate drops, you have a clean signal before you ever look at generation quality.

Generation Metrics

This is harder because there’s no single right answer to a natural-language question. Two practical approaches.

Faithfulness: does the answer only use information present in the retrieved context, or did the model hallucinate? You can score this by asking a separate LLM call: “given this context and this answer, is every claim in the answer supported by the context? Output yes or no.”

Answer correctness: does the answer match the reference? Embed both, compute cosine similarity, threshold. Or use an LLM judge: “compare answer A to reference B. Score from 1-5 on factual agreement.”

LLM-as-judge has known biases — position bias, verbosity bias, self-preference. It’s still useful when you control for them. Score both directions (A vs B, then B vs A) and average. Use a different model family for judging than for answering when you can.

# llama-index==0.8.68
from llama_index.evaluation import FaithfulnessEvaluator, CorrectnessEvaluator
from llama_index.llms import OpenAI

judge_llm = OpenAI(model="gpt-4-1106-preview", temperature=0.0)

faithfulness_eval = FaithfulnessEvaluator(llm=judge_llm)
correctness_eval = CorrectnessEvaluator(llm=judge_llm)

def evaluate(question, retrieved_context, answer, reference):
    f = faithfulness_eval.evaluate(
        query=question,
        response=answer,
        contexts=retrieved_context,
    )
    c = correctness_eval.evaluate(
        query=question,
        response=answer,
        reference=reference,
    )
    return f.score, c.score

The LlamaIndex evaluation module docs cover the available evaluators if you want to swap in others. Ragas is another decent option in the same space.

The CI Wiring

Now the part that makes the eval matter: run it on every PR.

# .github/workflows/rag-eval.yml
name: RAG Eval

on:
  pull_request:
    paths:
      - "src/rag/**"
      - "prompts/**"
      - "evals/**"

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.11"
      - run: pip install -r requirements.txt
      - name: Run retrieval eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: pytest evals/test_retrieval.py --tb=short
      - name: Run generation eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: pytest evals/test_generation.py --tb=short
      - name: Publish metrics
        if: always()
        run: python evals/publish_results.py

A few details that matter.

The eval is scoped by path filter. We don’t burn API credits running it on README changes. Retrieval eval is cheap (no LLM calls beyond embeddings) and runs on every relevant PR. Generation eval is expensive and runs only when the prompt or model changes — gated by paths.

Use a small, stable subset of the golden set for CI. Save the full set for nightly runs. Eight to twelve questions in CI catches gross regressions cheaply.

Use a snapshot of the index, not the live one. The CI run should be deterministic. Fix the embedding model version, the chunker version, the index snapshot. Otherwise you’ll chase non-issues.

Writing the Tests as pytest

# evals/test_retrieval.py
import pytest
from evals.golden import GOLDEN_CI
from rag.pipeline import build_retriever

retriever = build_retriever(index_snapshot="2023-11-20")

@pytest.mark.parametrize("gq", GOLDEN_CI, ids=lambda g: g.id)
def test_retrieval_hit_at_5(gq):
    nodes = retriever.retrieve(gq.question)
    retrieved_ids = [n.metadata["doc_id"] for n in nodes]
    assert any(eid in retrieved_ids[:5] for eid in gq.expected_doc_ids), \
        f"Expected one of {gq.expected_doc_ids} in top 5, got {retrieved_ids[:5]}"

def test_retrieval_aggregate():
    scores = []
    for gq in GOLDEN_CI:
        nodes = retriever.retrieve(gq.question)
        ids = [n.metadata["doc_id"] for n in nodes]
        scores.append(hit_at_k(ids, gq.expected_doc_ids, 5))
    assert sum(scores) / len(scores) >= 0.80

Per-question parametrized tests give you readable failures. Aggregate tests give you a single quality threshold. Both are useful.

Tracking Over Time

Single-PR pass/fail is necessary. So is the long view. Push eval results to a time series — Datadog, Honeycomb, or a Postgres table you query with a dashboard — keyed by commit SHA. Plot hit-rate, MRR, faithfulness, correctness over the last 90 days.

When someone files “the bot got worse” two weeks after a deploy, you can pull up the graph and either confirm or refute it. Without this you’re back to vibes.

This pairs well with the logging setup from my post on securing internal chatbots — together they give you the operational visibility you need.

Common Pitfalls

Synthetic eval sets. Tempting because they’re fast to generate. Useless because they don’t match user behavior. Hand-curate the first 50.

Drifting golden answers. When the underlying document is edited, the reference answer might be stale. Version your golden set. Re-validate quarterly.

LLM judge variance. Run each judge call with temperature 0. Even then, expect 5-10% noise across runs. Don’t react to single-percentage-point movements.

Cost. Generation evals on a full 50-question set with GPT-4 as judge can run $1-3 per run. CI on every PR adds up. Subset in CI, full set nightly.

Forgetting the negative cases. Include questions the bot should refuse to answer (out of scope, unauthorized) and verify it does. Internal bots are tempted to answer everything.

Evaluating against a moving index. The index changes daily as new docs arrive. Pin a snapshot for eval runs.

Wrapping Up

A 50-question eval set, retrieval and generation metrics, pytest in CI, time-series tracking. That’s it. None of it is novel. Almost no team does it.

The teams that do are the ones who can change models, prompts, and chunking with confidence. The teams that don’t are the ones surprised by silent quality regressions and chasing reports they can’t reproduce.

If you take one thing from the eval posts this month, take this: build the eval before you build the optimization. The order matters.

What’s Next

Next I want to write about the operational side — what to monitor, what to alert on, and what a useful day-to-day dashboard for a RAG system looks like in practice.

The Two Failure Modes

The Golden Dataset

Retrieval Metrics

Generation Metrics

The CI Wiring

Writing the Tests as pytest

Tracking Over Time

Common Pitfalls

Wrapping Up

What’s Next

Related posts

Evaluating RAG, Beyond Vibes-Based Testing

Securing an Internal LLM Chatbot, Threats, Boundaries, and What I Got Wrong

Shipping an Internal RAG Chatbot with LlamaIndex 0.8, What Actually Matters

Why Naive RAG Fails in Production, A 2024 Reality Check

Securing RAG Systems Against Data Exfiltration in 2025

Securing RAG Systems Against Data Exfiltration

LLM Vendor Risk, A Failover Playbook After the OpenAI Weekend

LangChain LCEL vs LlamaIndex, Picking a Framework in Late 2023

Let’s Start a Project