Evaluating RAG, Beyond Vibes-Based Testing
TL;DR — Build a labeled eval set, automate Ragas in CI, and sample production queries to catch drift. Skip the LLM-as-judge on every commit — it’s too expensive. Cache evaluations and use the cheaper metrics for regression gates.
If you’ve followed this month’s RAG arc — vector DBs, embeddings, chunking, hybrid search, security, re-ranking — you have a stack. The closing question is how you know any of those choices were right. Vibes don’t scale and ship bugs at 2am.
Evaluation in 2024 has matured. Ragas, TruLens, and DeepEval all exist. They share a metric vocabulary but emphasize different parts of the workflow. This post is the practical guide.
The four metrics that matter
Across all the frameworks, four metrics show up repeatedly:
Faithfulness. Does the generated answer follow from the retrieved chunks? Specifically: are all the factual claims in the answer supported by something in the context? A score below 1.0 means the model is generating content not grounded in the context — hallucination.
Answer relevance. Does the answer address the question? Independent of whether it’s correct. An off-topic answer scores low here.
Context precision. Of the chunks retrieved, how many are actually relevant? Measures the precision of the retrieval/reranking step.
Context recall. Of the relevant chunks that exist in the corpus, how many were retrieved? Requires ground-truth labels.
Faithfulness and answer relevance are reference-free — they only need the question, the context, and the answer. Context precision and recall need either ground-truth labels or an LLM-as-judge.
Ragas, the practical baseline
Ragas (v0.1.x in February 2024) is the framework I default to for CI integration. The API is clean and the metric implementations are sensible.
# Ragas 0.1 — February 2024
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
samples = Dataset.from_list([
{
"question": "What's the SLA for tier-3 support?",
"contexts": [chunk.text for chunk in retrieved_chunks],
"answer": llm_response,
"ground_truth": "Tier-3 support has a 4-hour response SLA during business hours.",
},
# ...more samples
])
result = evaluate(
samples,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result.to_pandas())
Each metric returns a 0-1 score per sample. result.to_pandas() gives you a dataframe you can compare against a baseline.
The catch: Ragas uses an LLM (default: GPT-3.5-turbo, configurable) to compute most of these metrics. Faithfulness involves the LLM checking each generated claim against the context. Context precision uses the LLM to judge relevance. This isn’t free.
Pricing math: 100 eval samples × 4 metrics × maybe 3 LLM calls per metric × ~$0.001 per call = ~$1.20 per full eval run. Not crippling, but if you run it on every commit you’ll notice.
TruLens, where it fits
TruLens emphasizes observability: it instruments your RAG pipeline at runtime and computes the same metrics on production traffic. The framework records every request, the retrieved chunks, the LLM response, and the evaluation scores.
# TruLens — wrap a chain for runtime evaluation
from trulens_eval import TruChain, Feedback, Tru
from trulens_eval.feedback.provider import OpenAI
tru = Tru()
openai_provider = OpenAI()
f_relevance = Feedback(openai_provider.relevance).on_input_output()
f_groundedness = Feedback(openai_provider.groundedness_measure_with_cot_reasons) \
.on(context=lambda r: r.context) \
.on_output()
tru_chain = TruChain(
chain,
app_id="rag-v1",
feedbacks=[f_relevance, f_groundedness],
)
with tru_chain as recording:
response = chain.invoke({"query": user_query})
Now every production query gets scored. You can compare cohorts (v1 vs v2 of your retrieval), filter by score, see which queries failed.
TruLens earns its keep when you want continuous quality monitoring rather than batch evals. The downside: every production query that you instrument costs an extra LLM evaluation. Sample, don’t evaluate everything.
DeepEval, where it fits
DeepEval frames evaluation as test cases. It’s pytest-shaped, which is convenient for CI.
# DeepEval — pytest-style
from deepeval import assert_test
from deepeval.metrics import FaithfulnessMetric, AnswerRelevancyMetric
from deepeval.test_case import LLMTestCase
def test_tier3_sla():
test_case = LLMTestCase(
input="What's the SLA for tier-3 support?",
actual_output=rag_pipeline.answer("What's the SLA for tier-3 support?"),
retrieval_context=rag_pipeline.last_retrieved_contexts,
expected_output="4 hours during business hours",
)
assert_test(test_case, [
FaithfulnessMetric(threshold=0.85),
AnswerRelevancyMetric(threshold=0.8),
])
Run it with pytest. The CI gate is “all tests pass.” This is the right shape for “if my retrieval regresses, fail the build.”
DeepEval, TruLens, and Ragas overlap heavily in metrics. Pick by where you want to integrate — Ragas for batch evals, TruLens for runtime observability, DeepEval for pytest-style gates.
Building the eval set
You can’t evaluate well without a labeled eval set. The questions to label:
- Coverage. 50-100 questions covering the query types your users actually ask. Don’t make them up; sample from real query logs.
- Difficulty distribution. Include hard cases — ambiguous queries, queries where the answer requires multiple chunks, queries that should be refused because there’s no good answer.
- Ground truth. For each question, the expected answer in 1-2 sentences. Optionally, the chunk IDs that should be retrieved.
- Maintenance. Plan to revisit quarterly. Eval sets rot as the corpus and user behavior shift.
# eval-set.yaml — keep it in the repo
- question: "What's the SLA for tier-3 support?"
expected_answer: "Tier-3 support has a 4-hour response SLA during business hours."
expected_chunks: ["doc-sla-001#sec3", "doc-sla-001#sec4"]
tags: ["support", "sla"]
- question: "Can I deploy from a feature branch?"
expected_answer: "Only the main branch is deployed automatically. Feature branches require an override."
expected_chunks: ["doc-deploy-002#feature-branches"]
tags: ["deploy", "branches"]
Labeling 100 questions takes maybe a day. It’s the highest-leverage day of work you can put in on a RAG project.
The CI integration
The trick is making eval cheap enough to run on every PR without burning budget.
Approach I’ve used:
- Cheap metrics on every PR. Context precision via embedding similarity (no LLM call) and exact-match retrieval against
expected_chunks. Fast and free. - Full LLM-as-judge on a schedule. Nightly run that scores faithfulness and answer relevance on the full eval set. Caches results so consecutive runs with no pipeline changes are free.
- Production sampling. 1-5% of real queries get evaluated via TruLens. Aggregate scores plotted weekly.
- Regression gates. PR fails if cheap metrics drop more than 5% from baseline. Nightly eval failure pages the on-call engineer.
# .github/workflows/rag-eval.yml
name: RAG eval
on:
pull_request:
paths:
- "rag/**"
- "eval-set.yaml"
jobs:
cheap-metrics:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install ragas pgvector openai
- run: python eval/run_cheap_metrics.py --baseline main
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The full eval runs nightly on a schedule with the same script and writes results to a dashboard.
Common Pitfalls
- Evaluating against synthetic queries. LLM-generated questions look diverse but cluster around easy cases. Use real production queries.
- Treating faithfulness == correctness. They’re not the same. A faithful answer can still be wrong if the retrieved chunks contained wrong information. Eval both.
- LLM-as-judge on every commit. It’s $1+ per run. Sample or cache.
- Skipping the per-cohort analysis. Overall score is meaningless if specific query types regressed. Slice by tag.
- No baseline on the eval set. Compare against a snapshot pinned to a commit, not against absolute targets. “5% drop from main” is more useful than “below 0.85.”
- Eval set rot. Refresh every quarter. Eval sets from 6 months ago test a different system than you’re shipping.
The mistake I personally made: I shipped a RAG system without an eval set, called retrieval “good,” and then couldn’t tell when re-ranking changes regressed precision. Built the eval set retroactively at month four. Should have been month zero.
Wrapping Up
Evaluation is what turns RAG from a demo into engineering. Ragas, TruLens, and DeepEval all work. Build a labeled eval set first, wire cheap metrics into PRs, run the LLM-as-judge nightly, sample production traffic for drift detection. This month’s arc — the failure modes, vector DBs, embeddings, chunking, hybrid search, security, re-ranking, and evaluation — adds up to a 2024 production playbook.
March moves to Rust in production, which is a different beast entirely. The Ragas documentation covers each metric in depth if you want the canonical references.