AI-Driven Test-Driven Development, A Practical Workflow
TL;DR — Use the model to write the failing test from a spec, then write the implementation yourself / pytest 8 is the contract both you and the model agree on / never let the same prompt produce the test and the code in one shot.
The fastest way to get a useless test suite is to ask a model to “write tests for this function.” It reads the implementation, mirrors it back as assertions, and produces tests that pass by construction. They prove the code does what the code does. They catch nothing.
Ai-driven test-driven development inverts that. The model never sees the implementation when it writes the test, because the implementation doesn’t exist yet. You hand it a behavioral spec — inputs, outputs, edge cases, error modes — and it produces a failing test. Then you, the engineer, write the code that turns the test green. The model is constrained by the same red-green-refactor discipline that makes classic TDD work, and you keep the part of the job that requires judgment.
I’ve run this workflow on a payments reconciliation service for the better part of a year. It’s not faster than writing tests by hand on trivial code. It’s dramatically faster on the gnarly stuff — currency rounding, retry semantics, partial failure — where enumerating edge cases is the hard part and the model is genuinely good at enumeration. This post is the concrete workflow with pytest 8 and the Anthropic API. If you want the broader CI picture, the AI code review pipeline post is a good companion.
The core rule, separate the two prompts
The single most important rule: the prompt that writes the test and the prompt that writes the implementation must be different conversations with no shared context. If the model sees its own implementation while writing tests, it tests the implementation. If it sees its own tests while writing the implementation, it overfits to them. Two firewalled steps. The spec is the only thing that crosses the boundary.
That spec is the artifact you actually maintain. It’s a structured description of behavior:
# specs/reconcile.yml
function: reconcile_batch
module: payments.reconcile
signature: "reconcile_batch(ledger: list[Entry], statement: list[Entry]) -> ReconResult"
behavior: >
Match ledger entries to bank statement entries by (amount, date_within_2_days).
Unmatched ledger entries become 'pending'. Unmatched statement entries
become 'unexplained'. Amounts compared at cent precision, never float ==.
edge_cases:
- empty ledger and empty statement -> empty result, no error
- duplicate amounts on the same day -> match greedily, oldest first
- statement entry dated before any ledger entry -> unexplained
- amount mismatch of 1 cent -> NOT a match
errors:
- ledger contains a None entry -> raise ValueError("null ledger entry")
- negative amount in statement -> raise ValueError("negative statement amount")
This file is human-authored. It’s the contract. Both prompts get it; neither prompt gets the other’s output.
Project setup
# pyproject.toml
[project]
name = "reconcile-service"
version = "0.2.0"
requires-python = ">=3.12"
dependencies = ["anthropic==0.49.0", "pyyaml==6.0.2"]
[project.optional-dependencies]
dev = ["pytest==8.3.5", "pytest-cov==6.1.1"]
[tool.pytest.ini_options]
addopts = "-ra --strict-markers --strict-config"
testpaths = ["tests"]
--strict-markers and --strict-config are not optional in this workflow. A generated test that uses an undeclared marker should fail loudly, not pass silently. You want pytest to be pedantic so a malformed generated test gets rejected at collection time.
Step one, generate the failing test
The test-generation prompt gets the spec and a hard instruction: do not write the implementation, do not import anything that doesn’t exist yet beyond the module under test, cover every edge case and error in the spec.
# tdd/generate_test.py
import os
import re
import yaml
from anthropic import Anthropic
client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
TEST_SYSTEM = """You write pytest 8 tests from a behavioral spec.
Rules, all mandatory:
- The implementation does NOT exist yet. Do not write it.
- Import only the module under test, pytest, and the stdlib.
- One test function per edge case and per error in the spec.
- Use pytest.raises(...) with match= for every error case.
- Use parametrize where it reduces duplication.
- Never assert on float equality; compare integer cents.
- Output ONLY a Python file, no prose, no markdown fences."""
def generate_test(spec_path: str) -> str:
spec = yaml.safe_load(open(spec_path))
prompt = (
f"Behavioral spec:\n\n{yaml.dump(spec, sort_keys=False)}\n\n"
f"Write the complete test module for `{spec['function']}`."
)
resp = client.messages.create(
model="claude-sonnet-4-5-20250929",
max_tokens=2500,
temperature=0,
system=TEST_SYSTEM,
messages=[{"role": "user", "content": prompt}],
)
return _strip_fences(resp.content[0].text)
def _strip_fences(text: str) -> str:
"""Models sometimes wrap output in ```python despite instructions."""
m = re.search(r"```(?:python)?\n(.*)\n```", text, re.DOTALL)
return m.group(1) if m else text.strip()
if __name__ == "__main__":
import sys
code = generate_test(sys.argv[1])
out = f"tests/test_{yaml.safe_load(open(sys.argv[1]))['function']}.py"
with open(out, "w") as fh:
fh.write(code + "\n")
print(f"wrote {out}")
Step two, verify the test actually fails
This is the step everyone skips and it’s the one that makes the workflow honest. A generated test must fail for the right reason before you trust it. If it errors at collection (syntax error, bad import beyond the missing module) that’s a broken test, not a red test. If it passes, the test is vacuous. The only acceptable state is a clean FAILED or a ModuleNotFoundError because the implementation genuinely doesn’t exist.
# tdd/verify_red.py
import subprocess
import sys
def verify_red(test_path: str) -> None:
result = subprocess.run(
[sys.executable, "-m", "pytest", test_path, "-q", "--no-header"],
capture_output=True,
text=True,
)
out = result.stdout + result.stderr
# Collection errors mean the generated test is malformed.
if "errors during collection" in out or "ERROR" in out.split("\n")[0]:
# ModuleNotFoundError for the unwritten module is acceptable.
if "ModuleNotFoundError" not in out:
raise SystemExit(f"test is broken, not red:\n{out}")
if result.returncode == 0:
raise SystemExit(
"generated test PASSED with no implementation — it is vacuous"
)
print("test is red for the right reason")
if __name__ == "__main__":
verify_red(sys.argv[1])
Wire this so a vacuous or broken test stops the pipeline. A test that passes before any code is written is worse than no test — it’s a false sense of coverage.
Step three, write the implementation yourself
Here’s where the discipline pays off. You read the failing test, you read the spec, and you write the implementation. Not the model — you. The test is your specification made executable. This is the part of TDD that builds your mental model of the problem, and outsourcing it hands away the understanding.
# payments/reconcile.py
from dataclasses import dataclass, field
from datetime import date, timedelta
@dataclass(frozen=True)
class Entry:
amount_cents: int # never float
when: date
@dataclass
class ReconResult:
matched: list[tuple[Entry, Entry]] = field(default_factory=list)
pending: list[Entry] = field(default_factory=list)
unexplained: list[Entry] = field(default_factory=list)
def reconcile_batch(ledger: list[Entry], statement: list[Entry]) -> ReconResult:
for e in ledger:
if e is None:
raise ValueError("null ledger entry")
for s in statement:
if s.amount_cents < 0:
raise ValueError("negative statement amount")
result = ReconResult()
remaining = sorted(statement, key=lambda s: s.when)
for entry in sorted(ledger, key=lambda e: e.when):
match = _find_match(entry, remaining)
if match is not None:
result.matched.append((entry, match))
remaining.remove(match)
else:
result.pending.append(entry)
result.unexplained = remaining
return result
def _find_match(entry: Entry, candidates: list[Entry]) -> Entry | None:
for c in candidates:
if c.amount_cents != entry.amount_cents:
continue
if abs((c.when - entry.when).days) <= 2:
return c
return None
Run pytest. If it’s green, the spec, the test, and the code all agree. If it’s red, one of the three is wrong — and figuring out which is the real engineering work.
Step four, the refactor pass with coverage as a gate
Once green, you can use the model for the refactor step, but only with coverage proving the test still constrains the behavior.
pytest tests/test_reconcile_batch.py --cov=payments.reconcile \
--cov-report=term-missing --cov-fail-under=95
If a refactor drops coverage below 95%, either the refactor introduced an untested branch or the model deleted a path. Either way, stop. The pytest 8 documentation on coverage integration covers the plugin wiring.
Common Pitfalls
One prompt for both test and code. The model will produce a consistent, self-confirming pair that proves nothing. Two firewalled prompts, always.
Skipping the red verification. A test that passes before the implementation exists is a vacuous test. verify_red exists precisely to catch this; don’t comment it out when it’s inconvenient.
Letting the model write the implementation. It can, technically. But then nobody on the team understands the code, and the next bug fix is a guessing game. Keep step three human.
Float assertions slipping in. Money, durations, percentages — if the spec doesn’t say “integer cents” the model will happily write assert result == 19.99. Put the precision rule in the spec.
Stale specs. The spec is the source of truth. If behavior changes and the spec doesn’t, your regenerated tests encode the old contract.
Troubleshooting
Symptom: generated test fails at collection with ModuleNotFoundError. Cause: the implementation module doesn’t exist yet. Fix: this is expected and verify_red treats it as a valid red state. Create the module skeleton, then proceed.
Symptom: verify_red reports the test PASSED. Cause: the test asserts something trivially true, or the function already exists with a stub return. Fix: inspect the test — usually a missing assertion or a too-loose pytest.raises without match=. Regenerate with a stricter spec.
Symptom: generated test uses an undeclared marker and errors. Cause: model invented a @pytest.mark.slow. Fix: --strict-markers catches this at collection; register the marker in pyproject.toml or regenerate without it.
Symptom: implementation is green but coverage is below threshold. Cause: the generated test missed an edge case from the spec. Fix: add the missing case to the spec, regenerate the test, confirm it goes red, then re-run.
Symptom: test output wrapped in markdown fences breaks the file. Cause: model ignored the no-fences instruction. Fix: _strip_fences handles it; if it still leaks, the model added prose before the fence — tighten the system prompt.
Wrapping Up
Ai-driven test-driven development works because it gives the model the job it’s good at, enumerating edge cases from a spec, and keeps the jobs that need judgment — writing the implementation and owning the spec — with the engineer. The red-verification step is the keystone; without it you’re just generating decorative tests. Next, fold this into CI so every spec change regenerates and re-verifies its test automatically.