Reference Architecture · generation

Function-Level Code Generation

Last updated: April 16, 2026

Quick answer

The production pattern generates tests from the spec first (Claude Sonnet 4), then generates the implementation (Claude Sonnet 4 or GPT-4o), runs the tests in a sandbox (Vercel Sandbox, E2B, or Modal), feeds failures back for up to 3 repair rounds, and blocks merge on type errors or lint failures. Expect $0.05 to $0.25 per function with 80-90 percent first-try pass rates on straightforward specs.

The problem

You want to generate a working function (in TypeScript, Python, Go, etc.) from a natural-language spec plus a signature, where 'working' means passes tests, passes the type checker, and conforms to the codebase style. Shipping raw LLM output into a codebase produces confident-looking broken code; the system needs a test-driven loop with sandboxed execution, static analysis, and bounded repair attempts.

Architecture

input

llm

data

infra

output

Spec & Signature Input

Accepts the target function signature, a natural-language description of behavior, acceptance criteria, and optional sample inputs/outputs. Also pulls in surrounding module context (imports, types, related functions).

Alternatives: GitHub issue parser, Inline comment trigger (like Copilot), Slack command

Codebase Context Builder

Retrieves type definitions, related functions, and style conventions from the codebase via AST parsing plus embedding search. Keeps context under a token budget.

Alternatives: Full-file include, Ripgrep + heuristic selection, tree-sitter based symbol graph, SCIP indexing

Test Generator

Given the spec and signature, emits unit tests covering the acceptance criteria plus edge cases (nulls, empty inputs, boundary values, error paths) before any implementation exists.

Alternatives: GPT-4o, DeepSeek R1 (strong reasoning about edge cases), Skip if spec already includes tests

Implementation Generator

Generates the function body given the signature, spec, codebase context, and the tests it needs to pass. Uses structured output with a code field and a rationale field.

Alternatives: GPT-4o, Claude Opus 4 for hard algorithmic problems, DeepSeek V3 for cost-sensitive volume

Sandboxed Execution

Runs the generated function against the tests in an isolated environment. Captures stdout, stderr, test results, and runtime traces. Hard timeout and memory limits enforced.

Alternatives: Vercel Sandbox (Firecracker), E2B, Modal, Docker container per request, AWS Lambda sandbox

Static Analysis Gate

Runs the type checker (tsc, mypy, go vet), linter (eslint, ruff, golangci-lint), and formatter (prettier, black, gofmt). Any error blocks progression.

Alternatives: ESLint + Prettier only (JS/TS), Language-specific toolchain, Skip for prototype workflows

Iterative Repair Loop

On test or static analysis failure, feeds the failure output back to the implementation model with an explicit instruction to fix. Capped at 3 rounds.

Alternatives: Self-consistency (N samples, vote), Use a stronger repair model (Opus) after first failure, Abort after 1 failure for interactive workflows

PR / Diff Output

Opens a pull request (or returns a diff) with the function, its tests, and a summary of what was generated, tests passed, and any caveats the model flagged.

Alternatives: GitHub PR, Direct commit to branch, Inline editor diff, Return JSON for downstream tooling

The stack

Implementation modelClaude Sonnet 4 as default, Opus 4 for hard problems

Sonnet 4 is the best balance of code quality and latency in 2026 for function-level generation. Opus 4 wins on tasks involving complex invariants (crypto, distributed systems, numerical stability) and is worth the 5x cost for those. Route by task complexity, not by default to Opus.

Alternatives: GPT-4o, DeepSeek V3 for cost, o3-mini for reasoning-heavy algorithmic tasks

Test generation modelClaude Sonnet 4 with explicit edge-case checklist in prompt

Test quality caps the whole pipeline — bad tests either pass broken code or force infinite repair loops. Sonnet 4 produces the most exhaustive tests when prompted with an explicit checklist (happy path, empty, null, boundary, error, concurrency). DeepSeek R1 is a strong choice when the function has subtle invariants.

Alternatives: GPT-4o, DeepSeek R1 for reasoning-heavy edge cases

Sandboxed executionVercel Sandbox (Firecracker microVMs) or E2B

Running generated code with any network access or unbounded resources on your own infrastructure is dangerous. Firecracker-based sandboxes (Vercel Sandbox, E2B) give you isolated VMs with boot times under 200ms at Lambda-like cost. Local subprocess with ulimits works for prototypes but does not isolate network or filesystem access sufficiently.

Alternatives: Modal, Docker-in-Docker, AWS Lambda, Local subprocess with strict ulimits

Static analysisFull project toolchain (tsc, eslint, prettier / mypy, ruff / go vet, golangci-lint)

Type errors and lint failures are cheap to detect and expensive to debug at runtime. Running the full static toolchain before sandbox execution catches ~30 percent of broken generations without spending a sandbox invocation, and gives the repair loop much more actionable feedback.

Alternatives: Linter only, Language server diagnostics via LSP

Repair strategyBounded repair (3 rounds max), escalate to stronger model on round 2

Most fixable generations are fixed in 1-2 rounds; beyond 3 rounds, the model is usually stuck in a local minimum and a fresh sample or human review is more likely to help. Escalating to Opus 4 on round 2 lifts success rate ~15 percent on hard problems at roughly 2x the overall cost for those attempts.

Alternatives: Unbounded retry (budget risk), Self-consistency (N samples, majority vote), One-shot only

EvaluationHumanEval + MBPP + internal held-out set of your own function specs

Published benchmarks measure general capability; your internal held-out set measures fit to your codebase conventions. Track first-try pass rate, after-repair pass rate, and rejection rate separately. A model that wins on HumanEval can lose on your codebase if style conventions are idiosyncratic.

Alternatives: SWE-bench (repo-level, different pattern), LiveCodeBench, Custom internal benchmarks

Cost at each scale

Prototype

200 functions/mo

$45/mo

Test generation (Sonnet 4)$10

Implementation (Sonnet 4)$15

Repair rounds (avg 0.8 extra calls per function)$8

Sandbox execution (E2B)$7

Hosting + CI$5

Startup

5k functions/mo

$850/mo

Test generation (Sonnet 4)$220

Implementation (Sonnet 4, cached context)$280

Repair (cached spec)$120

Sandbox execution$90

Static analysis + CI$40

Observability (Braintrust)$100

Scale

150k functions/mo

$18,500/mo

Test generation (Sonnet 4, cached context)$4,500

Implementation (mixed Sonnet 4 / Opus 4 by complexity)$6,500

Repair rounds$2,200

Sandbox execution (Vercel Sandbox / E2B)$1,800

Codebase indexing (embeddings + AST)$900

Infra + CI$1,000

Observability + evals$1,600

Latency budget

Total P50: 11,350ms

Total P95: 24,600ms

Context build (AST + embedding)

350ms · 900ms p95

Test generation (streamed)

2800ms · 5500ms p95

Implementation generation (streamed)

3200ms · 6500ms p95

Static analysis (tsc + eslint)

800ms · 2200ms p95

Sandbox execution

1200ms · 3500ms p95

Repair round (when triggered)

3000ms · 6000ms p95

Median

P95

Tradeoffs

Tests-first vs implementation-first

Generating tests before implementation locks the model into a verifiable spec and gives the repair loop concrete failure signals. Generating implementation first is faster (skip the test-generation call) but produces code that might satisfy the model's interpretation of the spec rather than the human's. For anything destined for production, tests-first is strictly better despite the extra call.

Repair vs re-sample

On failure, you can either feed the error back to the same model (repair) or sample a fresh generation with higher temperature (diversity). Repair is faster and cheaper but can get stuck in a local minimum where the model keeps making the same structural mistake. A hybrid — repair once, then re-sample with fresh context and higher temperature — lifts success rates by 10-15 percent on hard cases.

Sandbox everything vs trust the type checker

Running every generation in a sandbox catches runtime bugs the type checker misses but adds ~1-3 seconds per attempt. For pure functions with strong types (TypeScript, Rust) the type checker alone catches 70 percent of broken generations. For anything with I/O, concurrency, or loose types (Python without strict mypy), always sandbox.

Failure modes & guardrails

Generated code passes the tests but is subtly wrong (overfit)

Mitigation: Generate tests before the implementation and lock them (the implementation model never sees them as editable). Use property-based tests (Hypothesis, fast-check) for a second layer. Maintain a held-out eval set of 200+ functions with hidden tests to catch overfitting trends per model version.

Repair loop oscillates (fix A breaks B, fix B breaks A)

Mitigation: Detect oscillation by hashing the set of failing tests across repair rounds; if the hash repeats, abort the loop and escalate. On round 2, switch to a stronger model (Opus 4). On round 3 failure, mark for human review rather than continue burning budget.

Hallucinated imports or API calls against libraries not in the codebase

Mitigation: Parse the final import list with AST, verify every imported module exists in package.json / pyproject.toml / go.mod. Maintain an allowlist of known-good libraries for the project. Fail closed: a function with a hallucinated import is rejected, not merged.

Codebase context retrieval misses a required helper or type

Mitigation: On static-analysis failure with 'type not found' or 'undefined function' errors, automatically run a second retrieval pass seeded by the missing symbol names and retry. Track the symbols that are repeatedly missed; they are usually indexing bugs in the context builder.

Sandbox escape or resource abuse from generated code

Mitigation: Use Firecracker-based sandboxes with hard CPU, memory, network, and filesystem limits. Never run generated code on your main infrastructure. For any code that touches secrets or production data, require human review before execution outside the sandbox.

Frequently asked questions

Which LLM is best for function-level code generation in 2026?

Claude Sonnet 4 is the default choice — best balance of code quality and latency. Opus 4 wins on hard algorithmic problems (crypto, numerical stability, complex invariants) at about 5x the cost. GPT-4o is close to Sonnet. DeepSeek V3 is the value play for high volumes. Always benchmark against your codebase; model rankings flip by stack.

Should I generate tests before or after the implementation?

Before. Tests generated before the implementation act as an executable spec and give the repair loop concrete failure signals. Tests generated after tend to overfit to whatever the implementation happens to do, which masks bugs. Yes, this is an extra LLM call — and yes, it is worth it for anything going to production.

How many repair rounds are worth the cost?

Three rounds is the sweet spot in 2026 benchmarks. First round fixes ~45 percent of failures, round 2 adds ~25 percent, round 3 adds ~10 percent, and beyond that the marginal success rate is under 3 percent. After 3 rounds, a fresh sample or human review is more likely to help than more repair.

Do I need a sandbox, or can I run generated code locally?

You need a sandbox. Generated code can have infinite loops, high memory use, or network calls that you did not anticipate. Firecracker-based sandboxes (Vercel Sandbox, E2B) give you isolated VMs with sub-200ms boot times at low cost. Running generated code on your main infrastructure is not worth the risk.

How much codebase context should I include in the prompt?

Enough to answer the specific question, not more. Typical budget is 5-20k tokens: the target file's types and imports, 3-5 most-related functions retrieved by embedding similarity, and any shared utility types. Dumping the whole file or repo dilutes the signal and increases cost without improving quality past a point.

How do I evaluate code generation quality?

Track four metrics: first-try test pass rate, after-repair pass rate, static-analysis pass rate, and held-out benchmark (HumanEval, MBPP, plus your own internal set). First-try rate shows raw model quality; after-repair shows pipeline quality. Track these per model version so you can detect regressions.

Should I fine-tune on my codebase?

Usually no. Modern models with good context retrieval match fine-tuned quality for most codebases. Fine-tuning pays off when your codebase has strong idiosyncratic conventions (custom framework, unusual style) and volume is above ~10k generations/month. Start with prompt engineering plus retrieval; fine-tune only when plateauing.

What prevents the model from cheating by hallucinating imports or APIs?

Deterministic post-generation validation. Parse the AST, extract imports, and verify each against package.json / pyproject.toml / go.mod. Fail closed — reject any function that imports something not in the declared dependencies. The model can pretend it knows a library exists; the package manifest cannot.

Architectures

Test Case Generation

Reference architecture for generating unit and integration tests from existing code plus requirements. Coverag...

Invoice Structured Extraction

Reference architecture for turning PDF and image invoices into validated JSON: vendor, line items, tax, totals...

Models mentioned

claude-sonnet-4 claude-opus-4 gpt-4o deepseek-v3

Tools mentioned

e2b vercel-sandbox