Copilot for Tests, TDD or Anti-TDD? | Hi, I'm Muhammad Amal

Copilot article cover illustration on a gradient background

December 14, 2022 · 5 min read · by Muhammad Amal ai

TL;DR — Copilot is great at generating happy-path test cases. Tends to miss edge cases. Compatible with TDD if you write the first failing test yourself; lets you scaffold more cases faster. NOT a replacement for thinking about what to test.

After AI pair programming , one specific use case: test generation. Where Copilot shines and where it consistently misses.

What Copilot does well for tests

Happy-path coverage. Given a function, Copilot generates “the obvious test.”

func TestParseEmail(t *testing.T) {
    // Copilot suggests:
    got, err := ParseEmail("[email protected]")
    require.NoError(t, err)
    require.Equal(t, "[email protected]", got.String())
}

Correct. Boring. Necessary.

Test scaffolding from function signature.

def test_calculate_tax():
    # Copilot generates:
    assert calculate_tax(100, 0.1) == 10
    assert calculate_tax(0, 0.1) == 0

Two cases auto-generated. The framework boilerplate is there.

Variable test data. “Generate 10 valid email addresses to test against” — Copilot lists 10.

Table-driven test scaffolds:

tests := []struct {
    name  string
    input string
    want  int
}{
    // Copilot generates several rows:
    {"empty", "", 0},
    {"single", "a", 1},
    {"multiple", "abc", 3},
}

The boilerplate that bores you is exactly Copilot’s strength.

What Copilot misses

Edge cases beyond the obvious.

def parse_phone_number(s: str) -> str:
    # Strip non-digits, validate length, format
    ...

def test_parse_phone():
    # Copilot suggests:
    assert parse_phone_number("1234567890") == "(123) 456-7890"
    assert parse_phone_number("123-456-7890") == "(123) 456-7890"
    # MISSING:
    # - leading + (international)
    # - very long input (DoS)
    # - empty string
    # - non-Latin digits
    # - extension (1234567890 ext 123)
    # - Unicode whitespace

Copilot patterns match against typical test cases. Cases that come from understanding YOUR code’s actual edge handling aren’t suggested.

Negative tests. “What should NOT happen?” Copilot generates positive assertions; you have to remember to add the negative ones.

Integration vs unit. Copilot tends toward unit tests with mocks. Integration tests against real DBs / external services need explicit prompting.

Race conditions, timing. Concurrent test cases largely absent unless heavily prompted.

Property-based tests. Copilot doesn’t reach for hypothesis / quickcheck unless explicitly directed.

TDD compatibility

TDD red-green-refactor:

Write a failing test
Write minimal code to make it pass
Refactor

Where Copilot fits:

Step 1 (write test): human writes the FIRST test (the one that captures the new requirement). Copilot can extend with more cases after.
Step 2 (minimal code): Copilot suggests implementation; you accept the minimal-yet-sufficient subset.
Step 3 (refactor): Copilot less useful; refactoring is intent-driven.

If you let Copilot generate the test FIRST, you might write code for an obvious case and miss the intent. The first test sets the direction; protect it.

A workflow that works

For a new feature:

Write the first test by hand: the smallest failing case that captures intent.
Implement: Copilot helps with the boilerplate; I drive the logic.
Add more test cases: Copilot extends; I verify edge cases not yet covered.
Refactor: mostly manual.
Repeat for next bit of behavior.

The first test is the design decision; subsequent tests are mechanical coverage. Copilot accelerates step 4 (test extension), not step 1 (design).

Property-based testing

For functions with simple inputs/outputs, property-based tests catch what example-based tests miss:

from hypothesis import given, strategies as st

@given(st.integers(min_value=0), st.floats(min_value=0, max_value=1))
def test_tax_non_negative(amount, rate):
    assert calculate_tax(amount, rate) >= 0

@given(st.integers(min_value=1, max_value=1000), st.floats(min_value=0, max_value=1))
def test_tax_increases_with_amount(amount, rate):
    if rate > 0:
        assert calculate_tax(amount + 1, rate) >= calculate_tax(amount, rate)

Hypothesis (Python) / QuickCheck (Haskell) / proptest (Rust) / fast-check (JavaScript) — property-based libraries exist for most languages.

Copilot understands the libraries but rarely suggests them unsolicited. If you start with the import or a couple of @given decorators, Copilot extends.

Generated tests as first draft

A useful workflow: generate many tests; pick the keepers.

def calculate_discount(price, percent_off):
    return price * (1 - percent_off / 100)

def test_calculate_discount():
    # Copilot generates 10 cases
    # I delete the duplicates and edit assertions

End up with 4-5 actually useful tests in 30 seconds. Versus writing 10 tests manually in 5 minutes. Cost: trust verification on each.

What NOT to let Copilot generate

Tests for security-critical code. Auth, crypto, payment. Hand-write the cases that matter.

Tests for code with subtle semantics. Concurrency, time zones, money rounding. Tests need to encode the actual rules, not pattern-matched typical cases.

End-to-end / integration tests. Coordination across services. Setup / teardown is project-specific.

For these: human-written tests are the design document. Copilot can help with scaffolding boilerplate.

Coverage as a metric (cautious)

Copilot makes hitting “90% line coverage” easy. But coverage doesn’t measure quality.

90% coverage of generated trivial tests vs 70% coverage of carefully designed tests: the 70% catches more real bugs.
“Cover every line” leads to assertion-free tests (function runs without crashing). Useless.

Coverage is a floor, not a ceiling. AI makes the floor cheaper to reach; doesn’t make the ceiling higher.

Common Pitfalls

Accept all generated tests. They run; some don’t assert anything meaningful.

Skip thinking about edge cases. “Copilot wrote tests; we’re good.” Edge cases aren’t auto-generated.

Mistake covered for tested. Coverage = bytes ran. Tested = behavior verified.

Generate tests as the spec. Tests should match intent. Generate after intent is captured, not as the way to capture intent.

Use Copilot for property-based tests without checking properties. Some “property tests” Copilot writes are weak (just check non-null).

Skip mutation testing. Mutation testing reveals which tests actually catch bugs. AI tests often fail mutation testing.

Wrapping Up

Copilot accelerates test scaffolding; doesn’t replace test thinking. TDD still works; let Copilot help with the mechanical extension after you’ve defined the first test. Friday: Codespaces + Copilot .

What Copilot does well for tests

What Copilot misses

TDD compatibility

A workflow that works

Property-based testing

Generated tests as first draft

What NOT to let Copilot generate

Coverage as a metric (cautious)

Common Pitfalls

Wrapping Up

Related posts

Beyond Copilot, Tabnine, Codeium, Amazon CodeWhisperer

Pair Programming With an AI Assistant

Reviewing AI-Suggested Code

Prompt-Style Comments to Steer Copilot

What Copilot Is Good At (and What It Isn't)

A Year With GitHub Copilot in Production

AI Assist in Neovim, Copilot, Codeium, and ChatGPT in 2024

IP, Licensing, and AI-Generated Code

Let’s Start a Project