Reference Architecture · generation

Test Case Generation

Last updated: April 16, 2026

Quick answer

The production pipeline starts with coverage data to identify untested branches, generates tests using Claude Sonnet 4 or GPT-4o with the source file plus any existing tests for context, runs the new tests in a sandbox to verify they pass on correct code and fail on injected mutations, deduplicates against existing tests with AST matching, and blocks merge on tests that pass all mutations (tautological tests). Expect $0.02 to $0.15 per generated test with 70-85 percent mutation-kill rate.

The problem

You have functions, modules, or API endpoints with thin or missing test coverage. You want generated tests that actually catch regressions — not tests that assert the current implementation is whatever it happens to be, and not 400 redundant tests that inflate coverage numbers without catching bugs. The system must generate tests that compile, run, produce meaningful assertions, and survive a mutation testing audit.

Architecture

input

llm

data

infra

output

Coverage & Target Input

Accepts either a target file/module or a coverage report (lcov, coverage.json, go cover). Identifies untested lines, branches, and functions.

Alternatives: Full-file targeting, PR-diff targeting (test only new code), Risk-weighted selection (security-critical files first)

Spec & Context Builder

For each target function, gathers the source, dependent types, existing tests (for style), related functions, and any doc comments or requirements.

Alternatives: Full-file dump, AST-extracted function + callees, Retrieval from spec docs or tickets

Test Generator

Emits test cases covering happy path, edge cases, error paths, and branch coverage for the target. Produces tests in the codebase's existing testing framework (Jest, pytest, Go test, RSpec).

Alternatives: GPT-4o, DeepSeek R1 (strong for reasoning about edge cases), Claude Haiku 4 for high-volume simple functions

Deduplication Filter

AST-matches new tests against existing tests; drops tests whose assertions and setup closely match what is already tested. Prevents test-suite bloat.

Alternatives: Embedding similarity on test body, Line-level diff, Manual review for high-trust code

Sandboxed Test Runner

Runs the generated tests against the current (presumed correct) implementation in isolation. Any failing test is either a real bug or a bad generation; both cases block merge.

Alternatives: Vercel Sandbox, E2B, Modal, Docker-in-Docker, GitHub Actions matrix

Mutation Tester

Introduces small semantic changes (flip conditionals, remove null checks, alter constants) in the target code and reruns the generated tests. A test that passes every mutation is tautological and is rejected.

Alternatives: Stryker (JS/TS), mutmut (Python), go-mutesting (Go), PIT (Java)

Property-Based Test Generator (optional)

For functions with clean inputs/outputs, generates property-based tests (Hypothesis, fast-check, quickcheck-style) that explore the input space randomly.

Alternatives: Skip for UI-heavy code, Pair with example-based tests only for pure functions

PR Output

Opens a PR with the generated tests, a coverage delta, a mutation-kill rate summary, and any tests flagged as tautological that the developer should review.

Alternatives: Direct commit, Inline editor suggestion, Nightly auto-PR batch

The stack

Test generation modelClaude Sonnet 4

Sonnet 4 produces the most thorough and least tautological tests in 2026 benchmarks when given an explicit edge-case checklist. Haiku 4 is 5x cheaper and adequate for simple utility functions. DeepSeek R1 shines when the function has subtle invariants that require step-by-step reasoning about edge cases.

Alternatives: GPT-4o, DeepSeek R1 for edge-case reasoning, Claude Haiku 4 for simple functions

Mutation testingStryker (JS/TS), mutmut (Python), go-mutesting (Go)

Coverage alone is a weak signal — a test can execute every line without catching any bug. Mutation testing is the ground truth for whether tests catch regressions. Expect 10-30 minutes added per module; run it in CI for generated tests only, not the full suite.

Alternatives: PIT for Java/JVM, Skip for prototypes, Lightweight manual mutation checks

Sandboxed executionVercel Sandbox or E2B

Generated tests can contain infinite loops, resource leaks, or network calls to anywhere. Firecracker-based sandboxes isolate each test run. GitHub Actions works but is slow (90-180s overhead per run); microVM sandboxes run in under 2 seconds.

Alternatives: GitHub Actions matrix, Docker-in-Docker, Local subprocess with ulimit

DeduplicationAST matching on assertion patterns + embedding similarity fallback

LLMs love generating the same test three ways. AST matching on the assertion + setup combination catches exact duplicates cheaply. Embedding similarity on the test body catches semantic duplicates. Both together prevent 95+ percent of test-suite bloat.

Alternatives: Full-text similarity, No dedup (accept bloat), Manual review

Coverage toolingProject default (Istanbul/nyc for JS, coverage.py for Python, go cover for Go)

Use whatever the project already runs; adding a new coverage tool for test generation is friction. Branch coverage is more useful than line coverage — aim for branch-level targets in the generator prompt.

Alternatives: Codecov integration, Branch coverage via llvm-cov for Rust, Custom coverage aggregator

EvaluationMutation-kill rate + coverage delta + human review sample

Mutation-kill rate is the best automated measure of generated test quality. Pair with coverage delta to ensure you are targeting the right code. Sample 5-10 percent of generated tests for human review to catch systematic quality issues the automated metrics miss.

Alternatives: Coverage only (misleading), Braintrust eval harness, Custom benchmark on held-out modules

Cost at each scale

Prototype

500 tests/mo

$40/mo

Test generation (Sonnet 4)$18

Sandbox execution (E2B)$6

Mutation testing (CI overhead)$8

Hosting + observability$8

Startup

20k tests/mo

$850/mo

Test generation (Sonnet 4, cached context)$380

Sandbox execution (E2B / Vercel Sandbox)$150

Mutation testing$130

Deduplication (embeddings)$40

Infra + CI$50

Observability$100

Scale

500k tests/mo

$13,500/mo

Test generation (mixed Sonnet 4 / Haiku 4)$5,500

Sandbox execution$1,800

Mutation testing (dedicated runners)$2,200

Deduplication (vector DB)$600

Coverage aggregation + PR tooling$800

Infra (CI at scale)$1,400

Observability + evals$1,200

Latency budget

Total P50: 22,550ms

Total P95: 71,100ms

Coverage diff + target selection

400ms · 1200ms p95

Context build (AST + existing tests)

300ms · 800ms p95

Test generation (streamed)

2500ms · 5200ms p95

Deduplication (AST + embedding)

150ms · 400ms p95

Sandbox execution

1200ms · 3500ms p95

Mutation testing run

18000ms · 60000ms p95

Median

P95

Tradeoffs

Coverage targets vs business-logic targets

Generating tests to hit coverage targets is fast but produces lots of tautological tests on untested code that was probably dead anyway. Targeting business-critical modules (billing, auth, permissions) catches real bugs but requires human judgment to pick modules. A hybrid — coverage for the long tail, human-prioritized for hot paths — beats either alone.

Example-based vs property-based tests

Example-based tests are easy to read and debug; property-based tests cover input space much more thoroughly but fail in harder-to-interpret ways. For pure functions with clean inputs, generate both. For UI components, side-effect-heavy code, or tests with complex setup, stick to example-based.

Mutation testing cost vs coverage confidence

Mutation testing adds 10-30 minutes per module, which is painful in an interactive workflow. Running it only on generated tests (not the whole suite) and only on merge (not on every save) keeps cost bounded. Skipping it entirely means you cannot tell real tests from tautologies — pay the cost.

Failure modes & guardrails

Generated tests are tautological (assert the code does what it does)

Mitigation: Run mutation testing on every merge. Any test that passes under all mutations is flagged and rejected. Track mutation-kill rate per module; a sudden drop usually means a regression in the generation prompt or model version.

Tests depend on implementation details, break on refactors

Mitigation: Instruct the generator to test observable behavior through the public API, not internal state. After generation, scan tests for references to private methods or internal module structure and downgrade-score or reject those. Pair with a style guide snippet in the system prompt showing good and bad examples.

Test-suite bloat from duplicate or near-duplicate tests

Mitigation: Run AST-matching deduplication against existing tests before committing. Use embedding similarity (Voyage-3) with a 0.85 threshold as a second pass for semantic duplicates. Track test-suite growth rate; an explosion in test count without a matching mutation-kill improvement is a red flag.

Flaky tests from non-deterministic inputs (timestamps, random, network)

Mitigation: Prompt explicitly against Date.now, Math.random, and network calls; require test fixtures for dates and seeded randomness. Post-generation, scan for these patterns with regex and retry if found. Add a CI job that runs new tests 20 times; any test failing 1+ run out of 20 is flaky and rejected.

Tests passing locally but failing in CI due to environment drift

Mitigation: Generate and run tests in the same sandbox image used by CI. Pin test framework versions and node/python/go versions via the sandbox config. Any test that passes in the sandbox but fails in CI is an environment-drift bug in the sandbox config — track and alert on the gap.

Frequently asked questions

Which LLM is best for test case generation in 2026?

Claude Sonnet 4 produces the most thorough and least tautological tests when given an explicit edge-case checklist. GPT-4o is a close second. DeepSeek R1 shines on subtle invariants. Haiku 4 is the value play for high-volume simple functions. Always pair with mutation testing — model rankings on raw generation do not always track mutation-kill rates.

How is mutation testing different from coverage?

Coverage measures which lines the tests execute; mutation testing measures which bugs the tests catch. A test that imports a function and asserts it does not throw can hit 100 percent line coverage while catching zero bugs. Mutation testing introduces small semantic changes and checks that at least one test fails — this is the ground truth for test quality.

How do I stop the LLM from generating tautological tests?

Three layers: prompt the model to test observable behavior through the public API (not internal state), run mutation testing and reject tests that pass all mutations, and include 2-3 examples of bad tautological tests in the system prompt as negative examples. Mutation testing is the only reliable automated filter.

Should I generate property-based tests or example-based tests?

Both, when possible. For pure functions with clean types, property-based (Hypothesis, fast-check, quickcheck) explore input space far more thoroughly. For UI, I/O-heavy, or complex-setup code, example-based is easier to maintain and debug. A good generator produces a handful of example tests plus a property test when the function admits one.

How much does test generation cost per test?

At scale, $0.02 to $0.15 per committed test depending on complexity. Budget drivers are: generation cost (cheap), sandbox execution (cheap), and mutation testing (expensive — often 50-70 percent of total cost). Running mutation testing only on merge, not on every generation, keeps cost bounded.

Can generated tests replace human-written tests?

For coverage of the long tail — utility functions, serializers, parsers, conversion helpers — yes, and they should. For core business logic, security-sensitive code, and complex workflows, human-written tests are still stronger because humans encode domain knowledge the model lacks. Use generation to get to 70 percent coverage fast, then have humans write the critical 20 percent.

How do I prevent flaky generated tests?

Prompt explicitly against Date.now, Math.random, network calls, and real filesystem access. Require test fixtures for dates and seeded randomness. Scan generated code for these patterns and retry if found. Run each new test 20 times in CI; any test that fails 1+ time out of 20 is rejected as flaky.

When should I use test generation vs Copilot-style inline suggestions?

Copilot is best for interactive tests-while-coding. Batch test generation is best for backfilling coverage on legacy modules, generating tests from a spec before implementation, or nightly coverage-guided runs. The pipeline pattern in this architecture is for batch; the interactive case needs lower latency and skips mutation testing.

Architectures

Function-Level Code Generation

Reference architecture for generating production-quality functions from a signature and spec: test-first promp...

Invoice Structured Extraction

Reference architecture for turning PDF and image invoices into validated JSON: vendor, line items, tax, totals...

Models mentioned

claude-sonnet-4 gpt-4o deepseek-r1 claude-haiku-4

Tools mentioned

e2b vercel-sandbox