Generating and Maintaining Test Suites with Agent Loops

Pytest article cover illustration on a gradient background

February 24, 2026 · 9 min read · by Muhammad Amal programming

TL;DR — A single-shot prompt produces shallow tests / an agent loop reads the coverage report, targets uncovered branches, and iterates until a budget runs out / pytest 8 and coverage.py are the feedback signal that closes the loop.

Ask a model once for tests and you get tests for the obvious path. The happy case is covered, maybe one error. The branch where a retry exhausts its budget, the elif that handles a malformed header, the early return when the cache is cold — those stay dark. Not because the model can’t write them, but because nothing told it they exist.

Ai generated test suites become genuinely useful when you wrap the generation in a loop with a feedback signal. The signal is coverage.py’s branch report. Each iteration: run the suite, parse which branches are still uncovered, hand the model exactly those branches with their surrounding source, ask for tests that hit them. The loop terminates when coverage clears a threshold or a budget runs out. The model stops guessing what to test because coverage tells it precisely where the holes are.

I’ll be blunt about the limitation up front: this generates tests that exercise branches, not tests that assert correct behavior. It’s a coverage tool, not a correctness tool. For new code, pair it with the spec-first workflow in the AI-driven TDD post . Where this loop shines is the unglamorous job nobody volunteers for — backfilling coverage on a legacy module that has 30% coverage and a deadline.

The loop, conceptually

run pytest+coverage -> parse branch report
   -> any uncovered branches AND budget left?
        yes -> pick top-N branches, fetch source, generate tests, write file
        no  -> stop

Four moving parts: a runner, a coverage parser, a target picker, and a generator. Each iteration is one turn. State carried between turns is just the coverage delta and the spent budget.

Project setup

# pyproject.toml
[project]
name = "coverage-agent"
version = "0.1.0"
requires-python = ">=3.12"
dependencies = ["anthropic==0.49.0"]

[project.optional-dependencies]
dev = ["pytest==8.3.5", "pytest-cov==6.1.1", "coverage==7.8.0"]

[tool.coverage.run]
branch = true            # branch coverage, not just line coverage
source = ["app"]

[tool.coverage.report]
show_missing = true
skip_covered = true

branch = true is the whole point. Line coverage tells you a line ran; branch coverage tells you both sides of every if ran. An agent loop driven by line coverage stops too early, satisfied that a conditional’s line executed once.

Running the suite and emitting JSON

coverage.py can emit a structured JSON report. That’s far easier to parse reliably than scraping the terminal table.

# loop/runner.py
import json
import subprocess
import sys
from pathlib import Path


def run_coverage() -> dict:
    """Run the suite, return coverage.py's JSON report as a dict."""
    subprocess.run(
        [sys.executable, "-m", "pytest", "-q",
         "--cov=app", "--cov-branch", "--cov-report="],
        capture_output=True, text=True,
    )
    # Generate the JSON report from the just-written .coverage file.
    subprocess.run(
        [sys.executable, "-m", "coverage", "json", "-o", "coverage.json"],
        capture_output=True, text=True, check=True,
    )
    data = json.loads(Path("coverage.json").read_text())
    return data


def suite_is_green() -> bool:
    """A failing suite means generated tests broke — stop the loop."""
    result = subprocess.run(
        [sys.executable, "-m", "pytest", "-q", "--no-header"],
        capture_output=True, text=True,
    )
    return result.returncode == 0

Parsing uncovered branches

The JSON report gives per-file missing_branches as pairs of line numbers. Turn those into concrete targets the model can act on.

# loop/coverage_parser.py
from dataclasses import dataclass
from pathlib import Path


@dataclass
class BranchTarget:
    file: str
    source_line: int       # line where the branch decision lives
    snippet: str           # ~20 lines of context around it
    pct: float             # file-level branch coverage, for prioritising


def parse_targets(report: dict, context: int = 10) -> list[BranchTarget]:
    targets: list[BranchTarget] = []
    for path, file_data in report["files"].items():
        summary = file_data["summary"]
        pct = summary["percent_covered"]
        if pct >= 100:
            continue

        src_lines = Path(path).read_text().splitlines()
        seen: set[int] = set()
        # missing_branches: list of [from_line, to_line] pairs.
        for from_line, _to in file_data.get("missing_branches", []):
            if from_line in seen:
                continue
            seen.add(from_line)
            lo = max(0, from_line - context - 1)
            hi = min(len(src_lines), from_line + context)
            snippet = "\n".join(
                f"{i + 1:4d}| {src_lines[i]}" for i in range(lo, hi)
            )
            targets.append(BranchTarget(path, from_line, snippet, pct))

    # Lowest-coverage files first — biggest wins per iteration.
    targets.sort(key=lambda t: t.pct)
    return targets


def overall_branch_pct(report: dict) -> float:
    return report["totals"]["percent_covered"]

Numbered source lines in the snippet matter. When the model sees 42| if retries > max_retries: it can write a test that drives retries past the limit and reference the exact line in a comment. Naked source without line numbers makes the model guess.

Generating tests for specific branches

The generator gets a batch of targets, not the whole file. Each prompt is bounded, which keeps cost predictable and the model focused.

# loop/generator.py
import os
import re
from anthropic import Anthropic, APIStatusError
from loop.coverage_parser import BranchTarget

client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

SYSTEM = """You write pytest 8 tests that exercise specific uncovered branches.
For each target you are given numbered source and a branch line.
Rules:
- Write one test per uncovered branch. Name it test_<behavior>_line_<N>.
- Drive inputs so the branch at the given line is taken.
- Add assertions on the observable result. Never write `assert True`.
- Import the module under test by its real dotted path.
- Use fixtures and parametrize where it reduces duplication.
- Output ONLY Python, no prose, no markdown fences."""


def generate_tests(targets: list[BranchTarget], batch: int = 6) -> str:
    chosen = targets[:batch]
    blocks = []
    for t in chosen:
        blocks.append(
            f"--- target: {t.file}, branch at line {t.source_line} ---\n"
            f"{t.snippet}"
        )
    prompt = (
        "Write tests covering these uncovered branches:\n\n"
        + "\n\n".join(blocks)
    )
    try:
        resp = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=3000,
            temperature=0,
            system=SYSTEM,
            messages=[{"role": "user", "content": prompt}],
        )
    except APIStatusError as exc:
        raise RuntimeError(f"generation failed: {exc}") from exc
    return _strip_fences(resp.content[0].text)


def _strip_fences(text: str) -> str:
    m = re.search(r"```(?:python)?\n(.*)\n```", text, re.DOTALL)
    return m.group(1) if m else text.strip()

The loop driver

This ties it together: run, parse, check termination, generate, write, repeat. The termination conditions are a coverage target, an iteration cap, and a no-progress guard.

# loop/driver.py
from pathlib import Path
from loop.runner import run_coverage, suite_is_green
from loop.coverage_parser import parse_targets, overall_branch_pct
from loop.generator import generate_tests

TARGET_PCT = 90.0
MAX_ITERATIONS = 8
GEN_DIR = Path("tests/generated")


def run_loop() -> None:
    GEN_DIR.mkdir(parents=True, exist_ok=True)
    previous_pct = -1.0

    for iteration in range(1, MAX_ITERATIONS + 1):
        report = run_coverage()
        pct = overall_branch_pct(report)
        print(f"iteration {iteration}: branch coverage {pct:.1f}%")

        if pct >= TARGET_PCT:
            print(f"target {TARGET_PCT}% reached")
            return

        # No-progress guard: a stuck loop just burns tokens.
        if pct <= previous_pct + 0.5 and iteration > 1:
            print("coverage plateaued — remaining branches need human input")
            return
        previous_pct = pct

        targets = parse_targets(report)
        if not targets:
            print("no uncovered branches left to target")
            return

        code = generate_tests(targets)
        out = GEN_DIR / f"test_gen_iter_{iteration}.py"
        out.write_text(code + "\n")

        # Reject the batch if it broke the suite.
        if not suite_is_green():
            print(f"generated batch failed pytest — discarding {out.name}")
            out.unlink()
            return

    print(f"hit iteration cap ({MAX_ITERATIONS})")


if __name__ == "__main__":
    run_loop()

The no-progress guard earns its place. Some branches are genuinely unreachable from tests — defensive assert statements, if TYPE_CHECKING blocks, error paths that need a corrupted database. The loop will spin on those forever. When coverage moves less than half a percent in an iteration, stop and tell a human which branches are left.

The green check after each batch is equally important. A generated test that fails — wrong import, bad assumption about a fixture — must not land. Discard the batch and bail rather than poisoning the suite.

Maintaining the suite over time

Generation is half the job. The other half is keeping generated tests honest as the code changes. Two practices:

First, segregate generated tests into tests/generated/ so a human glancing at the suite knows which tests had a person’s judgment behind them. Second, run a periodic re-validation in CI: if a generated test starts failing after a code change, it’s either caught a real regression or encoded a stale assumption. Surface it for review instead of auto-deleting.

# .github/workflows/coverage-maintenance.yml
name: Coverage Maintenance
on:
  schedule:
    - cron: "0 3 * * 1"   # Monday 03:00, low-traffic window
  workflow_dispatch:

jobs:
  backfill:
    runs-on: ubuntu-latest
    timeout-minutes: 20
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: "3.12"
      - run: pip install -e ".[dev]"
      - name: Run coverage agent loop
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: python -m loop.driver
      - name: Open PR with new tests
        run: |
          git config user.name "coverage-agent"
          git config user.email "[email protected]"
          git checkout -b coverage/backfill-$(date +%Y%m%d)
          git add tests/generated/
          git diff --cached --quiet || {
            git commit -m "test, backfill branch coverage"
            gh pr create --fill --label automated
          }
        env:
          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}

Generated tests land in a PR, not on main directly. A human reads the assertions before they become part of the suite’s contract. The coverage.py documentation has the full branch-coverage reference if you need to tune what counts.

Common Pitfalls

Line coverage instead of branch coverage. Line coverage at 100% can still miss half the conditionals. Set branch = true.

No no-progress guard. Unreachable branches make the loop spin until the iteration cap, burning tokens for zero gain. Bail when coverage plateaus.

Auto-merging generated tests. Tests that pass aren’t tests that assert the right thing. Route them through a PR with human review.

Letting assert True through. A test that takes a branch but asserts nothing inflates the number and protects nothing. The system prompt forbids it; spot-check anyway.

Whole-file prompts. Sending the entire file per iteration is slow and expensive. Send numbered snippets around the uncovered branch lines only.

Troubleshooting

Symptom: coverage.json not found. Cause: pytest ran with --cov-report= (empty) but the coverage json step didn’t run or .coverage wasn’t written. Fix: confirm pytest exited having collected tests, then run coverage json against the .coverage file.

Symptom: loop reports 100% but obvious branches are untested. Cause: branch coverage is off; you’re seeing line coverage. Fix: add --cov-branch to the pytest call and branch = true in config.

Symptom: every generated batch fails the green check. Cause: the model is importing the module by the wrong path, or assuming fixtures that don’t exist. Fix: include the real dotted import path in each target block and pass a conftest.py excerpt listing available fixtures.

Symptom: coverage climbs then drops between iterations. Cause: a later batch shadowed test names from an earlier file. Fix: the iteration-numbered filenames prevent module collisions; check for duplicate test function names within a file and tell the model to suffix with the line number.

Symptom: loop never terminates early on legacy code. Cause: many defensive branches are unreachable from tests. Fix: the no-progress guard handles it; consider # pragma: no cover on genuinely defensive lines so they leave the target list.

Wrapping Up

An agent loop turns test generation from a one-shot guess into a closed feedback system, with coverage.py supplying the signal that tells the model exactly where to aim. Keep the loop honest with a no-progress guard, a green check, and a human-reviewed PR at the end. Next, layer in mutation testing so the loop optimizes for assertions that actually catch bugs, not just branches that happen to execute.

The loop, conceptually

Project setup

Running the suite and emitting JSON

Parsing uncovered branches

Generating tests for specific branches

The loop driver

Maintaining the suite over time

Common Pitfalls

Troubleshooting

Wrapping Up

Related posts

AI-Driven Test-Driven Development, A Practical Workflow

End-to-End Industrial AI, From Camera to Dashboard

Connecting Edge Vision Inference to an MQTT Telemetry Backbone

Tuning MQTT QoS and Persistence for Reliable Sensor Delivery

Optimizing MQTT Clusters for Critical Environmental Monitoring

Processing Millions of Sensor Events per Second with Go

Building a High-Throughput Telemetry Pipeline in Go

Deploying YOLO Models on NVIDIA Jetson with TensorRT

Let’s Start a Project