Reference Architecture · agent

Code Review Agent

Last updated: April 16, 2026

Quick answer

Production stacks use Claude Sonnet 4 as the primary reviewer with Claude Opus 4 escalation for complex diffs, paired with a GitHub App that posts line comments and a retrieval layer over repo context. Expect $0.10–$0.40 per PR and 25–90s review latency. Sonnet 4's tool-use reliability lets the agent actually fetch related files instead of hallucinating cross-file behavior.

The problem

Engineering teams ship hundreds of PRs a week and humans cannot review every line carefully. You need an agent that reads a diff plus relevant project context, flags real bugs and security issues, and leaves line-level comments on the PR — without spamming the thread with opinionated nitpicks. The challenge is precision (false positives train devs to ignore the bot) and covering cross-file reasoning that static analyzers miss.

Architecture

if high-riskGitHub PR WebhookINPUTDiff PreparerINFRARepo Context RetrievalDATADiff Triage ModelLLMReviewer ModelLLMDeep Review EscalationLLMComment FilterINFRAGitHub PR CommentsOUTPUT
input
llm
data
infra
output

GitHub PR Webhook

Receives pull_request opened/synchronize events. Ignores draft PRs and bot commits.

Alternatives: GitLab webhook, Bitbucket webhook, Manual /review command

Diff Preparer

Pulls the diff, expands hunks with ±20 lines of context, strips lockfiles and generated files, chunks by file.

Alternatives: Custom GitHub App, Danger.js pipeline

Repo Context Retrieval

Embedding search over the repo so the agent can pull related files (callers, types, tests) before reviewing.

Alternatives: pgvector, Pinecone, Sourcegraph Cody context

Diff Triage Model

Cheap pass that skips trivial diffs (whitespace, config bumps) and classifies risk level to decide whether to invoke Opus.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Reviewer Model

Reads diff + retrieved context + coding standards, produces structured line comments with severity.

Alternatives: GPT-4o, Gemini 2.5 Pro

Deep Review Escalation

Invoked only on high-risk files (auth, payments, SQL). Runs Claude Opus 4 with chain-of-thought for subtle bugs.

Alternatives: GPT-4o with reasoning, Gemini 2.5 Pro Thinking

Comment Filter

Dedupes, drops low-confidence nits, caps total comments per PR, checks against a suppressions file.

Alternatives: Rule-based severity threshold, LLM-as-judge filter

GitHub PR Comments

Posts line-level review comments and a summary review with request-changes/comment/approve decision.

Alternatives: GitLab MR notes, Slack summary, Linear comment

The stack

Primary reviewer LLMClaude Sonnet 4

Sonnet 4 has the best tool-use reliability for fetching repo context, and its false-positive rate on code review evals is ~30% lower than GPT-4o in 2026 benchmarks.

Alternatives: GPT-4o, Gemini 2.5 Pro

Deep review escalationClaude Opus 4

Reserved for security-sensitive files. Opus catches subtle logic bugs Sonnet misses but costs 5x more — only invoke on risk signals.

Alternatives: GPT-4o reasoning, Gemini 2.5 Pro Thinking

Triage modelClaude Haiku 4

Filters out trivial diffs before spending on Sonnet. Also used for the comment filter pass.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Code embeddingsVoyage-code-3

Voyage-code-3 is tuned for code and outperforms general-purpose embeddings by ~15% on code retrieval evals.

Alternatives: OpenAI text-embedding-3-large, Jina Code v2

Vector DBQdrant

Self-hostable next to your code, supports filtering by path/language, and cheap at repo scale.

Alternatives: pgvector, Turbopuffer

GitHub integrationOctokit + GitHub App

GitHub Apps get proper review API access (line comments, suggestions, request-changes) without a PAT.

Alternatives: Probot, Danger.js

EvalsBraintrust

Essential — run every prompt change against a golden set of historical PRs with known bugs before shipping.

Alternatives: Langfuse, Self-hosted + SWE-bench

Cost at each scale

Prototype

200 PRs/mo (one small team)

$45/mo

Sonnet 4 reviews$28
Haiku 4 triage + filter$3
Embeddings (one-time reindex)$4
Qdrant (self-host on Fly.io)$10
GitHub App hosting (Vercel Hobby)$0

Startup

5,000 PRs/mo (50-eng org)

$1,180/mo

Sonnet 4 reviews (with caching)$680
Opus 4 deep checks (~8% of PRs)$260
Haiku 4 triage + filter$55
Voyage embeddings + re-index$60
Qdrant Cloud$75
Braintrust evals$50

Scale

80,000 PRs/mo (enterprise, 800+ eng)

$14,200/mo

Sonnet 4 reviews (heavy prompt caching)$8,400
Opus 4 deep checks$2,800
Haiku 4 triage + filter$400
Voyage embeddings$600
Qdrant Enterprise$1,200
Infra + observability$800

Latency budget

Total P50: 11,680ms
Total P95: 23,320ms
Diff fetch + chunking
900ms · 2500ms p95
Haiku triage pass
600ms · 1200ms p95
Repo context retrieval
180ms · 420ms p95
Sonnet 4 review (per file)
2400ms · 4200ms p95
Opus 4 deep check (when invoked)
6500ms · 12000ms p95
Post comments to GitHub
1100ms · 3000ms p95
Median
P95

Tradeoffs

Coverage vs noise

Reviewing every line catches more bugs but drowns the signal. The better pattern is to only comment when confidence is high and severity is ≥ warning — shipping 3 real findings beats 30 nits every time. Tune the severity threshold against your team's tolerance.

Per-file vs whole-PR prompts

Per-file prompts are cheaper and parallelizable but miss cross-file bugs (a caller passing the wrong type). A hybrid works best: per-file Sonnet pass, then a whole-PR synthesis prompt only when files touch the same module.

Agentic retrieval vs static context

Letting the model call a 'read file' tool mid-review catches cross-file issues but adds 2–5s and sometimes loops. Pre-fetching related files via embedding retrieval is faster and deterministic, at the cost of occasionally missing obscure references.

Failure modes & guardrails

False-positive nitpicks train devs to ignore the bot

Mitigation: Ship with a strict severity floor (only 'bug' and 'security', never 'style'). Track comment acceptance rate per rule in Braintrust — auto-disable any rule under 40% acceptance after 100 impressions.

Agent hallucinates APIs or imports

Mitigation: Ground every code suggestion by requiring the model to cite the exact file and line it read. Reject any suggestion whose cited path is not in the repo (validate against a file manifest before posting).

Repo context retrieval returns wrong files

Mitigation: Rerank top-20 results with Cohere Rerank before passing to the reviewer. Filter retrieval by the PR's changed-language and ±2 directory levels from touched files.

Agent loops re-fetching files on large PRs

Mitigation: Hard cap at 6 tool-call rounds per file and 20 rounds per PR. If hit, fall back to static pre-fetched context and emit a metric — usually a sign the diff is too big and should be split.

Secrets or proprietary code leak to the LLM provider

Mitigation: Run a secret scanner (gitleaks) on the diff before sending. Route repos marked 'sensitive' through a self-hosted Llama 3.3 70B via Groq or an enterprise-tier AWS Bedrock endpoint with zero-retention.

Frequently asked questions

How much does an LLM code review agent cost per PR?

Expect $0.10–$0.40 per PR at startup scale with Sonnet 4 and smart triage. Enterprise volumes (80k PRs/month) land around $0.18 per PR with aggressive prompt caching and Haiku triage filtering ~30% of diffs out entirely.

Is Claude Sonnet 4 or GPT-4o better for code review in 2026?

Sonnet 4 leads on tool-use reliability and false-positive rate on code review evals. GPT-4o is competitive on single-file reasoning but produces more stylistic nits. Gemini 2.5 Pro is cheaper but weaker on multi-file refactors.

Should the agent request changes or only comment?

Only comment and leave merge decisions to humans, unless you have a specific high-confidence ruleset (e.g. 'never approve a PR that touches auth without a test'). Auto-blocking creates friction and erodes trust faster than any gain in rigor.

Do I need embeddings or can I just send the diff?

Diff-only works for self-contained PRs (~60% of them). For anything touching shared types, API contracts, or legacy code, you need retrieval — otherwise the agent hallucinates cross-file behavior. Start diff-only, add retrieval when false-positives hit 20%.

How do I prevent the agent from leaking proprietary code?

Use Anthropic's zero-retention enterprise tier or AWS Bedrock with a signed data processing agreement. For highly sensitive repos (crypto, healthcare), run Llama 3.3 70B on Groq or in-VPC via Together AI — expect ~15% quality drop vs Sonnet 4.

Can the agent replace human review?

No, and saying so will get your PR bot banned from the org. The working pattern is: agent catches the obvious stuff in 60s so the human reviewer focuses on design and intent. Merge decisions stay with humans.

What's the most common failure mode in production?

False-positive nits that tank signal-to-noise. Second: agent hallucinating a cross-file change that would break compilation. Both are mitigated by grounding (require file/line citations) and a strict severity floor — do not skip this.

Related