Reference Architecture · agent

QA Testing Agent

Last updated: April 16, 2026

Quick answer

The production stack uses Claude Sonnet 4 for test generation, an ephemeral sandbox (Vercel Sandbox, Daytona, or Docker) for execution, and Opus 4 for post-mortem analysis on failures. Expect $0.20–$1.00 per file tested and 30s–3min per test run. Key pattern: separate 'write tests' from 'fix tests' prompts — conflating them produces tests that just pass trivially.

The problem

Most codebases have 30–60% test coverage and gaps are where bugs hide. You need an agent that reads source + requirements, proposes high-value test cases (edge cases, error paths, not just happy path), writes them in your test framework, runs them, and diagnoses failures. The hard parts are generating tests that catch real bugs instead of tautological assertions, running them safely in an isolated environment, and distinguishing genuine failures from flakes.

Architecture

on failif real failureif passTriggerINPUTCode + Requirements RetrievalDATATest Case PlannerLLMTest Code GeneratorLLMEphemeral Execution SandboxINFRAFlake DetectorINFRAFailure DiagnoserLLMPR Comment / ReportOUTPUT
input
llm
data
infra
output

Trigger

PR opened, /test command in CI, nightly coverage scan, or file change event from a watch mode.

Alternatives: GitHub Actions, Pre-commit hook, CLI

Code + Requirements Retrieval

Pulls the target source file, related types, existing tests in the same module, and any linked requirements (PR description, Linear issue).

Alternatives: pgvector, Sourcegraph, Simple filesystem search

Test Case Planner

Enumerates test cases: happy path, edge cases (empty, null, max, min, concurrency), error paths. Prioritizes by estimated defect-finding value.

Alternatives: Claude Opus 4 for critical paths, GPT-4o

Test Code Generator

Generates test code in the project's framework (Jest/Vitest/Pytest/Go test). Matches existing test style and uses project fixtures.

Alternatives: GPT-4o, DeepSeek R1

Ephemeral Execution Sandbox

Spins up isolated container with the project's runtime, runs tests, captures output + coverage delta. Times out at 5 min.

Alternatives: Daytona, Docker-in-Docker, E2B, Modal

Flake Detector

Any failing test is re-run 3 times. If pass:fail ratio is non-deterministic, flag as flake and do not block the PR.

Alternatives: BuildKite Test Engine, CircleCI Flaky Test Detection, Custom

Failure Diagnoser

On real failure, reads error + test + source code, writes a root-cause analysis, optionally proposes a code or test fix.

Alternatives: Claude Sonnet 4, GPT-4o reasoning

PR Comment / Report

Posts new tests, failures with diagnoses, coverage delta, and suggested fixes back to the PR or Slack.

Alternatives: GitHub PR comment, Slack thread, Dashboard

The stack

Test generator LLMClaude Sonnet 4

Sonnet 4 produces tests that match existing project style better than GPT-4o and generates more meaningful edge cases. DeepSeek R1 is a strong cheap alternative but more verbose.

Alternatives: GPT-4o, DeepSeek R1

Failure diagnoserClaude Opus 4

Root-cause analysis on a failed test is a reasoning task. Opus 4 with extended thinking finds real bugs that Sonnet misses, worth the cost for the ~20% of runs that fail.

Alternatives: Claude Sonnet 4, GPT-4o reasoning

Execution sandboxVercel Sandbox

Ephemeral Firecracker microVMs start in <1s and isolate untrusted AI-generated code. Critical if the agent is writing tests that exec, hit the network, or touch the filesystem.

Alternatives: Daytona, E2B, Modal, Docker-in-Docker on K8s

Code retrievalQdrant + code embeddings

Tests need related source context. Embedding search returns the caller, the types, and similar existing tests — far better than file-path heuristics.

Alternatives: pgvector, Sourcegraph

Flake detectionBuildKite Test Engine or custom

AI-generated tests will be flaky — it's a feature of LLM variance. Auto-detect and quarantine flakes or you'll train the team to ignore the bot.

Alternatives: CircleCI, Custom SQL on test history

Coverage toolingLanguage-native (istanbul, coverage.py, go test -cover)

Coverage delta is the key signal for 'did this test add value'. Use the language-native tool and parse the output — avoid coverage-as-a-service latency.

Alternatives: Codecov, Coveralls

OrchestrationCustom pipeline (GitHub Actions or Inngest)

This is CI-shaped work. GitHub Actions gives you free compute for open-source and matches the existing workflow. LangGraph overcomplicates a linear DAG.

Alternatives: Temporal, LangGraph

Cost at each scale

Prototype

100 files tested/mo

$55/mo

Sonnet 4 planner + writer$25
Opus 4 diagnoser (~15% runs)$12
Vercel Sandbox minutes$10
Embeddings$3
Infra$5

Startup

8,000 files tested/mo

$3,600/mo

Sonnet 4 planner + writer (cached)$1,450
Opus 4 diagnoser$900
Vercel Sandbox minutes$600
Qdrant + embeddings$250
BuildKite Test Engine$200
Infra + observability$200

Scale

250,000 files tested/mo (enterprise monorepo)

$72,000/mo

Sonnet 4 (heavy caching)$28,000
Opus 4 diagnoser$18,000
Sandbox compute (dedicated)$12,000
Qdrant Enterprise + embeddings$5,500
BuildKite Enterprise$4,500
Infra + observability$4,000

Latency budget

Total P50: 44,100ms
Total P95: 138,500ms
Context retrieval
600ms · 1500ms p95
Test case planning
2500ms · 5000ms p95
Test code generation (5–10 tests)
6000ms · 12000ms p95
Sandbox startup + install
8000ms · 25000ms p95
Test execution
12000ms · 60000ms p95
Opus 4 diagnosis (on failure)
15000ms · 35000ms p95
Median
P95

Tradeoffs

Generate-then-run vs iterate-until-passes

Letting the agent loop 'write test, run, fix until passes' is tempting but usually produces tautological tests that just assert whatever the code currently does. Better: generate tests with an explicit oracle (requirements doc, spec, or behavioral description), run once, and diagnose real failures — don't let the agent rewrite the test to make it pass.

Sonnet 4 vs Opus 4 for generation

Sonnet 4 is fine for standard unit tests on well-specified code. Opus 4 is worth the extra cost for complex state machines, concurrency, and security-sensitive paths. Route by file tag — mark auth, payment, and state-machine files as 'opus-tier' in a config.

Sandbox per test run vs shared environment

Per-run sandbox is isolated and repeatable but adds 5–15s cold start. Shared environment is fast but risks state leakage between runs. Use per-run for PR checks, shared (with reset-between) for fast local dev loops.

Failure modes & guardrails

Agent writes tautological tests (assert that code does what it does)

Mitigation: Require tests to reference an external oracle — the PR description, a requirements doc, or a previous stable test. Reject tests whose assertions only restate the code under test. Run an LLM-as-judge critique pass that scores test 'meaningfulness'.

Flaky tests blocked PRs and team loses trust

Mitigation: Every failing test runs 3 times before being reported as failed. Tests that pass ≥1 and fail ≥1 out of 3 get quarantined automatically. Track flake rate per test file and alert when it exceeds 2%.

AI-generated test code runs arbitrary commands or exfiltrates data

Mitigation: Sandbox every run in an ephemeral microVM with no network access except to npm/PyPI mirrors. No env vars from the main project. Static-analyze generated test code for network calls, subprocess spawns, and filesystem writes outside /tmp before executing.

Tests hit real external services (databases, APIs)

Mitigation: The sandbox has no network access by default. Generated tests must use project's mocking fixtures; reject tests that import `requests`, `fetch`, or DB clients without a mock wrapper. Maintain an allowlist of mock-friendly modules.

Agent generates 50 low-value tests and clutters the suite

Mitigation: Enforce per-file test budget (max 8 tests per file per run). Planner prioritizes by estimated defect-finding value using a rubric (edge case > error path > happy path). Deduplicate against existing tests before writing — embed existing tests and check cosine similarity > 0.9.

Frequently asked questions

Does this replace Copilot's 'generate tests'?

It goes a step further. Copilot generates in-IDE, no execution. This architecture generates, runs, and diagnoses failures — which is where most of the value is. Copilot is a good fallback for local 'quick test' but not a full QA loop.

Which LLM generates the best tests in 2026?

Claude Sonnet 4 leads on test quality benchmarks (SWE-bench Verified Tests, HumanEval Plus). GPT-4o is close. DeepSeek R1 is a very strong price-performance pick at ~40% of the cost. Opus 4 shines specifically on diagnosis, not generation.

Why do I need a sandbox?

LLM-generated code will sometimes do something you don't expect — open a socket, read env vars, write to ~/.ssh. The sandbox makes that safe. Vercel Sandbox, Daytona, and E2B all give you <1s microVM starts that isolate execution. Don't run AI-generated test code in your main CI runner without isolation.

Does the agent fix bugs it finds, or just report them?

Report by default. Proposing a code fix is fine if it's clearly isolated (typo, off-by-one, null check). For anything ambiguous, let the human decide — an auto-fix that changes semantics is worse than a clear failure report.

How do I keep costs down at scale?

Prompt caching on the source file + project test style examples saves 40–60% on large monorepos. Only invoke Opus 4 on failure diagnosis, not generation. Skip files with recent test additions (last 7 days) — don't re-test what humans just touched.

Can it generate end-to-end tests too?

Yes, with Playwright or Cypress adapters. E2E is harder because the oracle is fuzzier (what does 'the page works' mean) and flake is higher. Start with unit and integration; add E2E once unit coverage is solid and you have good page-object abstractions.

What's the right coverage target?

80% line coverage on business logic, 60% on glue code. Don't chase 100% — the last 20% is usually trivial code (getters, config loading) and the tests add more noise than value. Track mutation-testing scores too — coverage without mutants killed is an illusion.

Related