Reference Architecture · generation
Function-Level Code Generation
Last updated: April 16, 2026
Quick answer
The production pattern generates tests from the spec first (Claude Sonnet 4), then generates the implementation (Claude Sonnet 4 or GPT-4o), runs the tests in a sandbox (Vercel Sandbox, E2B, or Modal), feeds failures back for up to 3 repair rounds, and blocks merge on type errors or lint failures. Expect $0.05 to $0.25 per function with 80-90 percent first-try pass rates on straightforward specs.
The problem
You want to generate a working function (in TypeScript, Python, Go, etc.) from a natural-language spec plus a signature, where 'working' means passes tests, passes the type checker, and conforms to the codebase style. Shipping raw LLM output into a codebase produces confident-looking broken code; the system needs a test-driven loop with sandboxed execution, static analysis, and bounded repair attempts.
Architecture
Spec & Signature Input
Accepts the target function signature, a natural-language description of behavior, acceptance criteria, and optional sample inputs/outputs. Also pulls in surrounding module context (imports, types, related functions).
Alternatives: GitHub issue parser, Inline comment trigger (like Copilot), Slack command
Codebase Context Builder
Retrieves type definitions, related functions, and style conventions from the codebase via AST parsing plus embedding search. Keeps context under a token budget.
Alternatives: Full-file include, Ripgrep + heuristic selection, tree-sitter based symbol graph, SCIP indexing
Test Generator
Given the spec and signature, emits unit tests covering the acceptance criteria plus edge cases (nulls, empty inputs, boundary values, error paths) before any implementation exists.
Alternatives: GPT-4o, DeepSeek R1 (strong reasoning about edge cases), Skip if spec already includes tests
Implementation Generator
Generates the function body given the signature, spec, codebase context, and the tests it needs to pass. Uses structured output with a code field and a rationale field.
Alternatives: GPT-4o, Claude Opus 4 for hard algorithmic problems, DeepSeek V3 for cost-sensitive volume
Sandboxed Execution
Runs the generated function against the tests in an isolated environment. Captures stdout, stderr, test results, and runtime traces. Hard timeout and memory limits enforced.
Alternatives: Vercel Sandbox (Firecracker), E2B, Modal, Docker container per request, AWS Lambda sandbox
Static Analysis Gate
Runs the type checker (tsc, mypy, go vet), linter (eslint, ruff, golangci-lint), and formatter (prettier, black, gofmt). Any error blocks progression.
Alternatives: ESLint + Prettier only (JS/TS), Language-specific toolchain, Skip for prototype workflows
Iterative Repair Loop
On test or static analysis failure, feeds the failure output back to the implementation model with an explicit instruction to fix. Capped at 3 rounds.
Alternatives: Self-consistency (N samples, vote), Use a stronger repair model (Opus) after first failure, Abort after 1 failure for interactive workflows
PR / Diff Output
Opens a pull request (or returns a diff) with the function, its tests, and a summary of what was generated, tests passed, and any caveats the model flagged.
Alternatives: GitHub PR, Direct commit to branch, Inline editor diff, Return JSON for downstream tooling
The stack
Sonnet 4 is the best balance of code quality and latency in 2026 for function-level generation. Opus 4 wins on tasks involving complex invariants (crypto, distributed systems, numerical stability) and is worth the 5x cost for those. Route by task complexity, not by default to Opus.
Alternatives: GPT-4o, DeepSeek V3 for cost, o3-mini for reasoning-heavy algorithmic tasks
Test quality caps the whole pipeline — bad tests either pass broken code or force infinite repair loops. Sonnet 4 produces the most exhaustive tests when prompted with an explicit checklist (happy path, empty, null, boundary, error, concurrency). DeepSeek R1 is a strong choice when the function has subtle invariants.
Alternatives: GPT-4o, DeepSeek R1 for reasoning-heavy edge cases
Running generated code with any network access or unbounded resources on your own infrastructure is dangerous. Firecracker-based sandboxes (Vercel Sandbox, E2B) give you isolated VMs with boot times under 200ms at Lambda-like cost. Local subprocess with ulimits works for prototypes but does not isolate network or filesystem access sufficiently.
Alternatives: Modal, Docker-in-Docker, AWS Lambda, Local subprocess with strict ulimits
Type errors and lint failures are cheap to detect and expensive to debug at runtime. Running the full static toolchain before sandbox execution catches ~30 percent of broken generations without spending a sandbox invocation, and gives the repair loop much more actionable feedback.
Alternatives: Linter only, Language server diagnostics via LSP
Most fixable generations are fixed in 1-2 rounds; beyond 3 rounds, the model is usually stuck in a local minimum and a fresh sample or human review is more likely to help. Escalating to Opus 4 on round 2 lifts success rate ~15 percent on hard problems at roughly 2x the overall cost for those attempts.
Alternatives: Unbounded retry (budget risk), Self-consistency (N samples, majority vote), One-shot only
Published benchmarks measure general capability; your internal held-out set measures fit to your codebase conventions. Track first-try pass rate, after-repair pass rate, and rejection rate separately. A model that wins on HumanEval can lose on your codebase if style conventions are idiosyncratic.
Alternatives: SWE-bench (repo-level, different pattern), LiveCodeBench, Custom internal benchmarks
Cost at each scale
Prototype
200 functions/mo
$45/mo
Startup
5k functions/mo
$850/mo
Scale
150k functions/mo
$18,500/mo
Latency budget
Tradeoffs
Tests-first vs implementation-first
Generating tests before implementation locks the model into a verifiable spec and gives the repair loop concrete failure signals. Generating implementation first is faster (skip the test-generation call) but produces code that might satisfy the model's interpretation of the spec rather than the human's. For anything destined for production, tests-first is strictly better despite the extra call.
Repair vs re-sample
On failure, you can either feed the error back to the same model (repair) or sample a fresh generation with higher temperature (diversity). Repair is faster and cheaper but can get stuck in a local minimum where the model keeps making the same structural mistake. A hybrid — repair once, then re-sample with fresh context and higher temperature — lifts success rates by 10-15 percent on hard cases.
Sandbox everything vs trust the type checker
Running every generation in a sandbox catches runtime bugs the type checker misses but adds ~1-3 seconds per attempt. For pure functions with strong types (TypeScript, Rust) the type checker alone catches 70 percent of broken generations. For anything with I/O, concurrency, or loose types (Python without strict mypy), always sandbox.
Failure modes & guardrails
Generated code passes the tests but is subtly wrong (overfit)
Mitigation: Generate tests before the implementation and lock them (the implementation model never sees them as editable). Use property-based tests (Hypothesis, fast-check) for a second layer. Maintain a held-out eval set of 200+ functions with hidden tests to catch overfitting trends per model version.
Repair loop oscillates (fix A breaks B, fix B breaks A)
Mitigation: Detect oscillation by hashing the set of failing tests across repair rounds; if the hash repeats, abort the loop and escalate. On round 2, switch to a stronger model (Opus 4). On round 3 failure, mark for human review rather than continue burning budget.
Hallucinated imports or API calls against libraries not in the codebase
Mitigation: Parse the final import list with AST, verify every imported module exists in package.json / pyproject.toml / go.mod. Maintain an allowlist of known-good libraries for the project. Fail closed: a function with a hallucinated import is rejected, not merged.
Codebase context retrieval misses a required helper or type
Mitigation: On static-analysis failure with 'type not found' or 'undefined function' errors, automatically run a second retrieval pass seeded by the missing symbol names and retry. Track the symbols that are repeatedly missed; they are usually indexing bugs in the context builder.
Sandbox escape or resource abuse from generated code
Mitigation: Use Firecracker-based sandboxes with hard CPU, memory, network, and filesystem limits. Never run generated code on your main infrastructure. For any code that touches secrets or production data, require human review before execution outside the sandbox.
Frequently asked questions
Which LLM is best for function-level code generation in 2026?
Claude Sonnet 4 is the default choice — best balance of code quality and latency. Opus 4 wins on hard algorithmic problems (crypto, numerical stability, complex invariants) at about 5x the cost. GPT-4o is close to Sonnet. DeepSeek V3 is the value play for high volumes. Always benchmark against your codebase; model rankings flip by stack.
Should I generate tests before or after the implementation?
Before. Tests generated before the implementation act as an executable spec and give the repair loop concrete failure signals. Tests generated after tend to overfit to whatever the implementation happens to do, which masks bugs. Yes, this is an extra LLM call — and yes, it is worth it for anything going to production.
How many repair rounds are worth the cost?
Three rounds is the sweet spot in 2026 benchmarks. First round fixes ~45 percent of failures, round 2 adds ~25 percent, round 3 adds ~10 percent, and beyond that the marginal success rate is under 3 percent. After 3 rounds, a fresh sample or human review is more likely to help than more repair.
Do I need a sandbox, or can I run generated code locally?
You need a sandbox. Generated code can have infinite loops, high memory use, or network calls that you did not anticipate. Firecracker-based sandboxes (Vercel Sandbox, E2B) give you isolated VMs with sub-200ms boot times at low cost. Running generated code on your main infrastructure is not worth the risk.
How much codebase context should I include in the prompt?
Enough to answer the specific question, not more. Typical budget is 5-20k tokens: the target file's types and imports, 3-5 most-related functions retrieved by embedding similarity, and any shared utility types. Dumping the whole file or repo dilutes the signal and increases cost without improving quality past a point.
How do I evaluate code generation quality?
Track four metrics: first-try test pass rate, after-repair pass rate, static-analysis pass rate, and held-out benchmark (HumanEval, MBPP, plus your own internal set). First-try rate shows raw model quality; after-repair shows pipeline quality. Track these per model version so you can detect regressions.
Should I fine-tune on my codebase?
Usually no. Modern models with good context retrieval match fine-tuned quality for most codebases. Fine-tuning pays off when your codebase has strong idiosyncratic conventions (custom framework, unusual style) and volume is above ~10k generations/month. Start with prompt engineering plus retrieval; fine-tune only when plateauing.
What prevents the model from cheating by hallucinating imports or APIs?
Deterministic post-generation validation. Parse the AST, extract imports, and verify each against package.json / pyproject.toml / go.mod. Fail closed — reject any function that imports something not in the declared dependencies. The model can pretend it knows a library exists; the package manifest cannot.