Generate tests that catch bugs instead of rubber-stamping them

Make an LLM write unit tests from your spec — not from your code — so the tests fail when the code is wrong instead of certifying the bug.

The problem

You ask an LLM to "write some unit tests for this function." It hands you a green wall of passing tests. You feel covered. You are not.

The trap is mechanical: the model reads your function, runs it in its head, sees that split_amount(100, 3) returns [33, 33, 33], and writes assert split_amount(100, 3) == [33, 33, 33]. The test passes. But the function is wrong — that's only 99 cents. The test didn't check your intent; it transcribed your bug and stamped it APPROVED. Now the bug is load-bearing: the next person who fixes the function will watch your test suite turn red and assume they broke something.

A test is only worth writing if it can fail when the code is wrong. To get those, the model has to compute the expected answer from the spec — not from the code in front of it. That's the whole game, and it's one paragraph of prompt away.

When to use this — and when not to

Use it when:

You have a function with a describable contract — pure logic, parsing, formatting, money/date/units math, sorting, pagination, validation. Anything where "correct" is a rule, not a vibe.
You're about to refactor something untested and want a net first.
A bug just bit you and you want the regression test plus the neighbors you didn't think of.

Skip it when:

There's no spec, only the code. If you can't say what correct means, the model can't either — it'll fall back to copying the implementation, and you're back to rubber-stamping. Write the contract first; that act alone finds half the bugs.
The hard part is the environment, not the logic (network flakiness, DB state, race conditions). LLMs write the assertions; they don't conjure the fixtures.
The function's output is genuinely subjective (prose quality, ranking relevance). Use a different tool — an eval, not a unit test.

The recipe

Paste your function and a spec. If you don't have a written spec, make the model propose one and approve it before it writes a single test. The load-bearing sentence is "derive every expected value from the spec, assuming the code may be wrong."

You are writing unit tests for the function below.

CONTRACT (the source of truth — the code may or may not satisfy it):
<paste your spec as a bullet list of guarantees. If you have none,
 STOP and instead write a contract first, then ask me to approve it.>

FUNCTION UNDER TEST:
<paste the function>

Rules:
1. Compute every expected value by hand FROM THE CONTRACT, never by
   running or simulating the function. Assume the implementation may
   be buggy — your job is to catch that, not to match it.
2. Cover, at minimum: a normal case; each boundary (0, 1, empty, max,
   exact division vs. remainder); and each individual guarantee in the
   contract as its own test.
3. For each test, add a one-line comment naming the specific guarantee
   or failure mode it protects. If a test can't fail on any plausible
   bug, delete it.
4. Prefer property checks (sum is preserved, length matches, output is
   sorted) over hard-coded outputs where a property captures the rule.
5. Plain asserts, no framework needed. Runnable as-is.

After the tests, list any guarantee you could NOT turn into a test and
why.

The difference between this and "write tests for X" is rule 1. Everything else is hygiene.

A worked example

Generated by the autonomous pipeline against a deliberately buggy function — but until the execution harness attaches a captured transcript, it's labeled as composed, not proven. The badge will say which. Both versions, on the kind of bug that passes a casual eyeball:

def split_amount(total_cents, n):
    """Split total_cents across n parts as evenly as possible.
    Guarantees: len == n; every part >= 0; parts differ by at most 1;
    sum(parts) == total_cents; larger parts come first."""
    base = round(total_cents / n)
    return [base] * n

The ugly first attempt — what "write some unit tests for this function" produced. Note test_three_way: the model ran the code, saw [33, 33, 33], and froze it in:

def test_three_way():
    # observed output of the function
    assert split_amount(100, 3) == [33, 33, 33]   # 99 cents. Wrong. Passes anyway.

def test_ten_three():
    assert split_amount(10, 3) == [3, 3, 3]        # 9 cents. Also wrong. Also green.

Running it:

=== LAZY TESTS (rubber-stamp the code) ===
PASS test_basic
PASS test_three_way
PASS test_ten_three
PASS test_single

Four for four. A perfectly green suite certifying a function that loses money.

The recipe's tests — same function, but expected values derived from the contract:

def test_sum_is_preserved():
    # Guarantee: sum(parts) == total. 100 split 3 ways must total 100.
    assert sum(split_amount(100, 3)) == 100

def test_remainder_not_dropped():
    # 10 / 3 -> parts must total 10, e.g. [4, 3, 3]. Catches dropped remainder.
    assert sum(split_amount(10, 3)) == 10

def test_rounding_up_case():
    # 100 / 6 = 16.67; round() pushes every part to 17 -> sum 102 > total.
    assert sum(split_amount(100, 6)) == 100

Running it against the same buggy code:

=== SPEC TESTS (the recipe) ===
FAIL test_sum_is_preserved
FAIL test_remainder_not_dropped
PASS test_parts_differ_by_at_most_one
PASS test_larger_parts_first
FAIL test_rounding_up_case

Three real failures, including one (round() overshooting to 102 cents) I hadn't consciously thought of — the boundary rule in the prompt surfaced it.

The proof that this isn't an accident: I wrote the correct implementation —

def split_amount(total_cents, n):
    base, rem = divmod(total_cents, n)
    return [base + 1] * rem + [base] * (n - rem)

— and reran both suites against it:

=== SPEC TESTS on FIXED code ===     === LAZY TESTS on FIXED code ===
PASS test_sum_is_preserved           PASS test_basic
PASS test_remainder_not_dropped      FAIL test_three_way
PASS test_parts_differ_by_at_most_one    FAIL test_ten_three
PASS test_larger_parts_first         PASS test_single
PASS test_rounding_up_case

Look at the right column. The lazy tests fail on the correct code — they had baked the bug in as the definition of "right." That is the exact disaster you ship when you trust a green suite that was written from the implementation. The spec tests do what tests are for: red on broken, green on fixed.

Where it breaks

The model copies the implementation anyway. Even with the prompt, a model will sometimes "simulate" the function and match it — especially when the logic is short and tempting. Patch: don't give it the function at all for the first pass. Ask it to write tests from the contract alone, then paste the code and ask "do any of these now look like they were reverse-engineered from an implementation?" Tests written before the code can't have copied it.

Confidently wrong expected values. The model does the by-hand math and gets it wrong — now you have a test that fails on correct code. Patch: this is actually the system working; a failing test makes you check. For the handful that matter, verify the expected value yourself or compute it a second, dumber way (rule 4's property checks are immune to this — sum == total needs no arithmetic). Never mass-accept failing tests by "fixing" them to match the code; that's the lazy trap with extra steps.

Plausible-looking coverage that misses the one case. It tests n=1, n=3, n=4 and skips n > total_cents (more parts than cents) or total_cents = 0. Patch: end with a second prompt — "What input would make this function misbehave that none of these tests cover?" Run it twice; the second pass is where the nasty ones show up. Make it report what it couldn't turn into a test (rule 6) so the gaps are written down, not silent.

No real contract, so it invents one. If your "spec" is vague ("formats a date nicely"), the model picks a contract and tests against its guess, which may not be yours. Patch: treat the model's proposed contract as the deliverable of step one. Read it. The disagreements between its contract and your mental model are bugs in your spec — fix those before any test gets written.

Cost & time

One function with a clear contract: a single chat message in, 40–80 lines of tests out. **$0.02–0.04** and under a minute of model time; call it 10 minutes total once you've written the contract and eyeballed the expected values. The optional "what's still uncovered?" follow-up roughly doubles the cost and is worth it every time.

Compared to the alternative — shipping a green suite that fails the day someone fixes your bug — it's the cheapest insurance you'll buy this week.