Evaluating LLM Agents, From Vibes to Regression Suites

Agents article cover illustration on a gradient background

May 24, 2024 · 8 min read · by Muhammad Amal programming

TL;DR — Three test types. Deterministic checks on tool calls, LLM-as-judge for free-text quality, end-to-end trajectory matching. Run them on every model upgrade or your agent will silently regress.

Every team I’ve worked with has the same trajectory. They build an agent, eyeball some outputs, ship it. The agent breaks in interesting ways three weeks later. They write a few ad-hoc tests. The tests pass but the agent still feels off. Then a model upgrade lands (gpt-4-turbo to a new snapshot, or claude-3-sonnet to claude-3.5-sonnet whenever that arrives) and the team realizes they have no way to tell if the new model is better, worse, or just differently broken.

The way out of this loop is to treat agents like any other piece of stateful infrastructure. Tests for the deterministic parts. Rubrics for the fuzzy parts. Regression runs on every change. The tooling is good enough in May 2024 that there’s no excuse for not doing this.

This post is the eval setup I’d build today. We’ll cover the three test types, how to write them, how to run them in CI, and the workflow that makes them useful when a new model drops. The code uses LangSmith for the platform piece because it’s what I use, but every pattern translates to BrainTrust, Promptfoo, or a homegrown setup.

Three test types, three things you’re checking

A common mistake is reaching for LLM-as-judge for everything. It’s overkill for the parts you can check deterministically and underpowered for the parts that genuinely need judgment. Use the right tool per test.

Deterministic checks are for anything with a single correct answer. Did the agent call the right tool? Did it pass the right arguments? Did it terminate within N steps? These are unit tests, full stop.

LLM-as-judge is for free-text quality where there isn’t a single right answer but there are clearly better and worse responses. Is the response helpful? Does it follow the format? Does it avoid hedging when it should be confident?

Trajectory matching is for end-to-end runs where the sequence of steps matters. Did the agent take a reasonable path? Did it avoid loops? Did it ask for clarification when it should have?

Deterministic checks first

These are cheap, fast, and catch the most regressions. Start here.

import pytest
from myagent import run_agent

def test_lookup_only_uses_search_tool():
    trace = run_agent("What's the status of order ord_abc1234567?", trace=True)
    tool_calls = [s.tool_name for s in trace.steps if s.tool_name]
    assert tool_calls == ["search_orders"]

def test_refund_requires_human_approval():
    trace = run_agent("Refund order ord_abc1234567 for $40.", trace=True)
    assert any(s.kind == "awaiting_approval" for s in trace.steps)
    assert not any(s.tool_name == "create_refund" for s in trace.steps)

def test_terminates_within_step_budget():
    trace = run_agent("Tell me about my recent orders.", trace=True, max_steps=10)
    assert trace.completed
    assert len(trace.steps) <= 10

The thing to notice is that none of these check the agent’s final text response. They check structural properties of the trajectory. That’s deliberate. Structural assertions are stable across model upgrades; text-content assertions are not.

Run these in CI. They’ll catch the obvious regressions (someone changed a tool name, model picked the wrong tool, infinite loop) within seconds. Don’t gate every PR on the slow eval suite; gate every PR on these.

LLM-as-judge for the fuzzy parts

For evaluating response quality, an LLM judge is the only scalable option. The pattern that works is a rubric with discrete criteria, a strong evaluator model, and a fixed prompt template.

JUDGE_PROMPT = """You are evaluating a customer support agent's response.

User question:
{question}

Agent response:
{response}

Score the response on three axes, each 1-5:
- correctness: factually accurate based on the question
- helpfulness: addresses the user's underlying need
- format: appropriate length, no hedging, no walls of text

Return JSON only:
{{"correctness": int, "helpfulness": int, "format": int, "rationale": str}}
"""

from openai import OpenAI
import json

client = OpenAI()

def judge(question: str, response: str) -> dict:
    resp = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[{"role": "user", "content": JUDGE_PROMPT.format(question=question, response=response)}],
        response_format={"type": "json_object"},
        temperature=0,
    )
    return json.loads(resp.choices[0].message.content)

The choices that matter. A different model for the judge than the agent helps avoid the model rating its own work favorably. A fixed temperature of 0 on the judge for reproducibility. Discrete scores, not continuous; a 1-5 scale is more reliable than asking for a 0.0-1.0 score. A rationale field that you can spot-check; if the rationale doesn’t justify the score, the judge is unreliable.

The honest limitation of LLM-as-judge is that it tracks the model’s biases. A judge that consistently rates verbose responses higher will reward verbosity. You catch this by spot-checking a sample of judged outputs against human ratings periodically. If the agreement drops, your rubric needs work.

Trajectory matching for end-to-end

The hardest tests, and the most valuable. You take a set of canonical scenarios, capture the “good” trajectory once (manually verified), and on each run compare the agent’s new trajectory against the canonical one.

GOLDEN = [
    {
        "id": "order_status_basic",
        "input": "Where is my order ord_abc1234567?",
        "expected_tools": ["search_orders"],
        "expected_max_steps": 3,
        "must_mention": ["ord_abc1234567"],
        "must_not_mention": ["refund", "cancel"],
    },
    {
        "id": "refund_with_approval",
        "input": "Refund order ord_abc1234567 for damaged item, $40.",
        "expected_tools": ["search_orders", "create_refund"],
        "requires_approval_for": ["create_refund"],
        "expected_max_steps": 6,
    },
]

def check_scenario(scenario, trace):
    failures = []
    actual_tools = [s.tool_name for s in trace.steps if s.tool_name]
    if scenario.get("expected_tools"):
        for t in scenario["expected_tools"]:
            if t not in actual_tools:
                failures.append(f"missing tool {t}")
    if len(trace.steps) > scenario.get("expected_max_steps", 999):
        failures.append(f"too many steps: {len(trace.steps)}")
    final = trace.final_response or ""
    for word in scenario.get("must_mention", []):
        if word not in final:
            failures.append(f"final missing {word!r}")
    for word in scenario.get("must_not_mention", []):
        if word in final:
            failures.append(f"final mentioned forbidden {word!r}")
    return failures

This isn’t quite testing “did the agent give the right answer.” It’s testing “did the agent take a reasonable path and end up somewhere reasonable.” That’s the right granularity for an agent. Exact-match on final text is too brittle; pure judge-based scoring is too fuzzy. Trajectory shape is the sweet spot.

Running it on every change

In CI on every PR, run the deterministic tests. They should pass in under 30 seconds. On every merge to main, run the full suite including LLM-as-judge and trajectory matching. That’s a 5-10 minute job for a few hundred scenarios.

On every model upgrade, rerun the full suite against the new model and compare to the last known-good run. LangSmith makes this easy with its experiment comparison view, but the same pattern works with any platform that stores per-scenario scores.

def run_suite_against(model: str) -> dict:
    results = {}
    for scenario in GOLDEN:
        trace = run_agent(scenario["input"], model=model, trace=True)
        results[scenario["id"]] = {
            "failures": check_scenario(scenario, trace),
            "judge": judge(scenario["input"], trace.final_response),
            "tokens": trace.total_tokens,
            "latency_ms": trace.latency_ms,
        }
    return results

baseline = run_suite_against("gpt-4-turbo")
candidate = run_suite_against("gpt-4-turbo-2024-04-09")
diff_report(baseline, candidate)

The diff report is what you read before deciding whether to roll forward. Regressions in failure count are blocking. Regressions in judge scores are flags for human review. Improvements are nice but should be verified against a human-rated sample before you celebrate.

For more context on the agent loop structures these tests exercise, see /blog/react-reflexion-planner-executor-agent-loops/ . The LangSmith docs on evaluation cover the platform side in more depth if you’re starting from scratch; the patterns above work without it but the dashboards help.

Common Pitfalls

The traps that turn an eval suite into a false-confidence machine.

Testing only the happy path. Your scenarios should include the user being rude, the user being confused, the user trying to jailbreak. Production has these; your eval set should too.
Letting the eval set drift from production. Periodically sample real conversations (with consent and privacy controls) and add the interesting ones to your eval set. Otherwise your tests get stale.
Trusting LLM-as-judge without spot checks. Pull 20 judged outputs a week and human-rate them. Confirm the judge tracks reality.
No deterministic check budget. If your deterministic tests take 5 minutes, no one runs them locally. Keep them under 30 seconds or split them into a fast and a slow tier.
Single-shot evals on a non-deterministic agent. Temperature 0 helps but doesn’t eliminate variance. Run each scenario 3x and report the worst case for assertions, the median for judge scores.
Evals as a final gate, not a feedback loop. The eval should run when the developer is iterating, not just before merge. Local eval scripts that take a single scenario name are gold.
No cost tracking in evals. A new model with great accuracy at 5x the cost is not a win. Track tokens per scenario alongside quality.

Wrapping Up

Evaluating agents is one of those areas where the discipline is more important than the tools. You can build a perfectly good eval suite with pytest, an LLM judge, and a CSV of scenarios. You can also pay for a fancy platform that gives you dashboards. Either works. What doesn’t work is shipping an agent with no systematic evaluation and hoping you’ll notice when it breaks.

The thing model providers don’t tell you when they ship a new snapshot is that your agent’s behavior will subtly shift. Tool selection changes, response formats drift, edge cases that worked before now don’t. Without a regression suite, you find out from users. With one, you find out in CI. That’s the entire argument.

Start with the deterministic checks. They catch the largest fraction of regressions for the least effort. Add LLM-as-judge once you have something to measure. Add trajectory matching last, when your scenarios are stable enough to be worth canonicalizing. The order matters; trying to start with trajectory matching on a still-evolving agent will burn time you don’t have.

Three test types, three things you’re checking

Deterministic checks first

LLM-as-judge for the fuzzy parts

Trajectory matching for end-to-end

Running it on every change

Common Pitfalls

Wrapping Up

Related posts

Cost Control for LLM Agents, Token Budgets and Anthropic Prompt Caching

Guardrails for LLM Agents in 2024, Llama Guard, Rebuff, and NeMo

Memory for AI Agents, Short Term, Long Term, and What to Store Where

ReAct, Reflexion, and Planner Executor, Agent Loop Patterns That Work

Multi Agent Conversations with AutoGen, Patterns and Pitfalls

Designing Tools for LLM Agents, Function Schemas That Survive Production

Production Agents with LangGraph, State Machines Over Chains

The Agentic AI Landscape in May 2024, LangGraph, AutoGen, CrewAI

Let’s Start a Project