Generating and Maintaining Test Suites with Agent Loops
TL;DR — A single-shot prompt produces shallow tests / an agent loop reads the coverage report, targets uncovered branches, and iterates until a budget runs out / pytest 8 and coverage.py are the feedback signal that closes the loop.
Ask a model once for tests and you get tests for the obvious path. The happy case is covered, maybe one error. The branch where a retry exhausts its budget, the elif that handles a malformed header, the early return when the cache is cold — those stay dark. Not because the model can’t write them, but because nothing told it they exist.
Ai generated test suites become genuinely useful when you wrap the generation in a loop with a feedback signal. The signal is coverage.py’s branch report. Each iteration: run the suite, parse which branches are still uncovered, hand the model exactly those branches with their surrounding source, ask for tests that hit them. The loop terminates when coverage clears a threshold or a budget runs out. The model stops guessing what to test because coverage tells it precisely where the holes are.
I’ll be blunt about the limitation up front: this generates tests that exercise branches, not tests that assert correct behavior. It’s a coverage tool, not a correctness tool. For new code, pair it with the spec-first workflow in the AI-driven TDD post . Where this loop shines is the unglamorous job nobody volunteers for — backfilling coverage on a legacy module that has 30% coverage and a deadline.
The loop, conceptually
run pytest+coverage -> parse branch report
-> any uncovered branches AND budget left?
yes -> pick top-N branches, fetch source, generate tests, write file
no -> stop
Four moving parts: a runner, a coverage parser, a target picker, and a generator. Each iteration is one turn. State carried between turns is just the coverage delta and the spent budget.
Project setup
# pyproject.toml
[project]
name = "coverage-agent"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = ["anthropic==0.49.0"]
[project.optional-dependencies]
dev = ["pytest==8.3.5", "pytest-cov==6.1.1", "coverage==7.8.0"]
[tool.coverage.run]
branch = true # branch coverage, not just line coverage
source = ["app"]
[tool.coverage.report]
show_missing = true
skip_covered = true
branch = true is the whole point. Line coverage tells you a line ran; branch coverage tells you both sides of every if ran. An agent loop driven by line coverage stops too early, satisfied that a conditional’s line executed once.
Running the suite and emitting JSON
coverage.py can emit a structured JSON report. That’s far easier to parse reliably than scraping the terminal table.
# loop/runner.py
import json
import subprocess
import sys
from pathlib import Path
def run_coverage() -> dict:
"""Run the suite, return coverage.py's JSON report as a dict."""
subprocess.run(
[sys.executable, "-m", "pytest", "-q",
"--cov=app", "--cov-branch", "--cov-report="],
capture_output=True, text=True,
)
# Generate the JSON report from the just-written .coverage file.
subprocess.run(
[sys.executable, "-m", "coverage", "json", "-o", "coverage.json"],
capture_output=True, text=True, check=True,
)
data = json.loads(Path("coverage.json").read_text())
return data
def suite_is_green() -> bool:
"""A failing suite means generated tests broke — stop the loop."""
result = subprocess.run(
[sys.executable, "-m", "pytest", "-q", "--no-header"],
capture_output=True, text=True,
)
return result.returncode == 0
Parsing uncovered branches
The JSON report gives per-file missing_branches as pairs of line numbers. Turn those into concrete targets the model can act on.
# loop/coverage_parser.py
from dataclasses import dataclass
from pathlib import Path
@dataclass
class BranchTarget:
file: str
source_line: int # line where the branch decision lives
snippet: str # ~20 lines of context around it
pct: float # file-level branch coverage, for prioritising
def parse_targets(report: dict, context: int = 10) -> list[BranchTarget]:
targets: list[BranchTarget] = []
for path, file_data in report["files"].items():
summary = file_data["summary"]
pct = summary["percent_covered"]
if pct >= 100:
continue
src_lines = Path(path).read_text().splitlines()
seen: set[int] = set()
# missing_branches: list of [from_line, to_line] pairs.
for from_line, _to in file_data.get("missing_branches", []):
if from_line in seen:
continue
seen.add(from_line)
lo = max(0, from_line - context - 1)
hi = min(len(src_lines), from_line + context)
snippet = "\n".join(
f"{i + 1:4d}| {src_lines[i]}" for i in range(lo, hi)
)
targets.append(BranchTarget(path, from_line, snippet, pct))
# Lowest-coverage files first — biggest wins per iteration.
targets.sort(key=lambda t: t.pct)
return targets
def overall_branch_pct(report: dict) -> float:
return report["totals"]["percent_covered"]
Numbered source lines in the snippet matter. When the model sees 42| if retries > max_retries: it can write a test that drives retries past the limit and reference the exact line in a comment. Naked source without line numbers makes the model guess.
Generating tests for specific branches
The generator gets a batch of targets, not the whole file. Each prompt is bounded, which keeps cost predictable and the model focused.
# loop/generator.py
import os
import re
from anthropic import Anthropic, APIStatusError
from loop.coverage_parser import BranchTarget
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
SYSTEM = """You write pytest 8 tests that exercise specific uncovered branches.
For each target you are given numbered source and a branch line.
Rules:
- Write one test per uncovered branch. Name it test_<behavior>_line_<N>.
- Drive inputs so the branch at the given line is taken.
- Add assertions on the observable result. Never write `assert True`.
- Import the module under test by its real dotted path.
- Use fixtures and parametrize where it reduces duplication.
- Output ONLY Python, no prose, no markdown fences."""
def generate_tests(targets: list[BranchTarget], batch: int = 6) -> str:
chosen = targets[:batch]
blocks = []
for t in chosen:
blocks.append(
f"--- target: {t.file}, branch at line {t.source_line} ---\n"
f"{t.snippet}"
)
prompt = (
"Write tests covering these uncovered branches:\n\n"
+ "\n\n".join(blocks)
)
try:
resp = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=3000,
temperature=0,
system=SYSTEM,
messages=[{"role": "user", "content": prompt}],
)
except APIStatusError as exc:
raise RuntimeError(f"generation failed: {exc}") from exc
return _strip_fences(resp.content[0].text)
def _strip_fences(text: str) -> str:
m = re.search(r"```(?:python)?\n(.*)\n```", text, re.DOTALL)
return m.group(1) if m else text.strip()
The loop driver
This ties it together: run, parse, check termination, generate, write, repeat. The termination conditions are a coverage target, an iteration cap, and a no-progress guard.
# loop/driver.py
from pathlib import Path
from loop.runner import run_coverage, suite_is_green
from loop.coverage_parser import parse_targets, overall_branch_pct
from loop.generator import generate_tests
TARGET_PCT = 90.0
MAX_ITERATIONS = 8
GEN_DIR = Path("tests/generated")
def run_loop() -> None:
GEN_DIR.mkdir(parents=True, exist_ok=True)
previous_pct = -1.0
for iteration in range(1, MAX_ITERATIONS + 1):
report = run_coverage()
pct = overall_branch_pct(report)
print(f"iteration {iteration}: branch coverage {pct:.1f}%")
if pct >= TARGET_PCT:
print(f"target {TARGET_PCT}% reached")
return
# No-progress guard: a stuck loop just burns tokens.
if pct <= previous_pct + 0.5 and iteration > 1:
print("coverage plateaued — remaining branches need human input")
return
previous_pct = pct
targets = parse_targets(report)
if not targets:
print("no uncovered branches left to target")
return
code = generate_tests(targets)
out = GEN_DIR / f"test_gen_iter_{iteration}.py"
out.write_text(code + "\n")
# Reject the batch if it broke the suite.
if not suite_is_green():
print(f"generated batch failed pytest — discarding {out.name}")
out.unlink()
return
print(f"hit iteration cap ({MAX_ITERATIONS})")
if __name__ == "__main__":
run_loop()
The no-progress guard earns its place. Some branches are genuinely unreachable from tests — defensive assert statements, if TYPE_CHECKING blocks, error paths that need a corrupted database. The loop will spin on those forever. When coverage moves less than half a percent in an iteration, stop and tell a human which branches are left.
The green check after each batch is equally important. A generated test that fails — wrong import, bad assumption about a fixture — must not land. Discard the batch and bail rather than poisoning the suite.
Maintaining the suite over time
Generation is half the job. The other half is keeping generated tests honest as the code changes. Two practices:
First, segregate generated tests into tests/generated/ so a human glancing at the suite knows which tests had a person’s judgment behind them. Second, run a periodic re-validation in CI: if a generated test starts failing after a code change, it’s either caught a real regression or encoded a stale assumption. Surface it for review instead of auto-deleting.
# .github/workflows/coverage-maintenance.yml
name: Coverage Maintenance
on:
schedule:
- cron: "0 3 * * 1" # Monday 03:00, low-traffic window
workflow_dispatch:
jobs:
backfill:
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.12"
- run: pip install -e ".[dev]"
- name: Run coverage agent loop
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
run: python -m loop.driver
- name: Open PR with new tests
run: |
git config user.name "coverage-agent"
git config user.email "[email protected]"
git checkout -b coverage/backfill-$(date +%Y%m%d)
git add tests/generated/
git diff --cached --quiet || {
git commit -m "test, backfill branch coverage"
gh pr create --fill --label automated
}
env:
GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Generated tests land in a PR, not on main directly. A human reads the assertions before they become part of the suite’s contract. The coverage.py documentation
has the full branch-coverage reference if you need to tune what counts.
Common Pitfalls
Line coverage instead of branch coverage. Line coverage at 100% can still miss half the conditionals. Set branch = true.
No no-progress guard. Unreachable branches make the loop spin until the iteration cap, burning tokens for zero gain. Bail when coverage plateaus.
Auto-merging generated tests. Tests that pass aren’t tests that assert the right thing. Route them through a PR with human review.
Letting assert True through. A test that takes a branch but asserts nothing inflates the number and protects nothing. The system prompt forbids it; spot-check anyway.
Whole-file prompts. Sending the entire file per iteration is slow and expensive. Send numbered snippets around the uncovered branch lines only.
Troubleshooting
Symptom: coverage.json not found. Cause: pytest ran with --cov-report= (empty) but the coverage json step didn’t run or .coverage wasn’t written. Fix: confirm pytest exited having collected tests, then run coverage json against the .coverage file.
Symptom: loop reports 100% but obvious branches are untested. Cause: branch coverage is off; you’re seeing line coverage. Fix: add --cov-branch to the pytest call and branch = true in config.
Symptom: every generated batch fails the green check. Cause: the model is importing the module by the wrong path, or assuming fixtures that don’t exist. Fix: include the real dotted import path in each target block and pass a conftest.py excerpt listing available fixtures.
Symptom: coverage climbs then drops between iterations. Cause: a later batch shadowed test names from an earlier file. Fix: the iteration-numbered filenames prevent module collisions; check for duplicate test function names within a file and tell the model to suffix with the line number.
Symptom: loop never terminates early on legacy code. Cause: many defensive branches are unreachable from tests. Fix: the no-progress guard handles it; consider # pragma: no cover on genuinely defensive lines so they leave the target list.
Wrapping Up
An agent loop turns test generation from a one-shot guess into a closed feedback system, with coverage.py supplying the signal that tells the model exactly where to aim. Keep the loop honest with a no-progress guard, a green check, and a human-reviewed PR at the end. Next, layer in mutation testing so the loop optimizes for assertions that actually catch bugs, not just branches that happen to execute.