Reference Architecture · agent
Code Review Agent
Last updated: April 16, 2026
Quick answer
Production stacks use Claude Sonnet 4 as the primary reviewer with Claude Opus 4 escalation for complex diffs, paired with a GitHub App that posts line comments and a retrieval layer over repo context. Expect $0.10–$0.40 per PR and 25–90s review latency. Sonnet 4's tool-use reliability lets the agent actually fetch related files instead of hallucinating cross-file behavior.
The problem
Engineering teams ship hundreds of PRs a week and humans cannot review every line carefully. You need an agent that reads a diff plus relevant project context, flags real bugs and security issues, and leaves line-level comments on the PR — without spamming the thread with opinionated nitpicks. The challenge is precision (false positives train devs to ignore the bot) and covering cross-file reasoning that static analyzers miss.
Architecture
GitHub PR Webhook
Receives pull_request opened/synchronize events. Ignores draft PRs and bot commits.
Alternatives: GitLab webhook, Bitbucket webhook, Manual /review command
Diff Preparer
Pulls the diff, expands hunks with ±20 lines of context, strips lockfiles and generated files, chunks by file.
Alternatives: Custom GitHub App, Danger.js pipeline
Repo Context Retrieval
Embedding search over the repo so the agent can pull related files (callers, types, tests) before reviewing.
Alternatives: pgvector, Pinecone, Sourcegraph Cody context
Diff Triage Model
Cheap pass that skips trivial diffs (whitespace, config bumps) and classifies risk level to decide whether to invoke Opus.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Reviewer Model
Reads diff + retrieved context + coding standards, produces structured line comments with severity.
Alternatives: GPT-4o, Gemini 2.5 Pro
Deep Review Escalation
Invoked only on high-risk files (auth, payments, SQL). Runs Claude Opus 4 with chain-of-thought for subtle bugs.
Alternatives: GPT-4o with reasoning, Gemini 2.5 Pro Thinking
Comment Filter
Dedupes, drops low-confidence nits, caps total comments per PR, checks against a suppressions file.
Alternatives: Rule-based severity threshold, LLM-as-judge filter
GitHub PR Comments
Posts line-level review comments and a summary review with request-changes/comment/approve decision.
Alternatives: GitLab MR notes, Slack summary, Linear comment
The stack
Sonnet 4 has the best tool-use reliability for fetching repo context, and its false-positive rate on code review evals is ~30% lower than GPT-4o in 2026 benchmarks.
Alternatives: GPT-4o, Gemini 2.5 Pro
Reserved for security-sensitive files. Opus catches subtle logic bugs Sonnet misses but costs 5x more — only invoke on risk signals.
Alternatives: GPT-4o reasoning, Gemini 2.5 Pro Thinking
Filters out trivial diffs before spending on Sonnet. Also used for the comment filter pass.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Voyage-code-3 is tuned for code and outperforms general-purpose embeddings by ~15% on code retrieval evals.
Alternatives: OpenAI text-embedding-3-large, Jina Code v2
Self-hostable next to your code, supports filtering by path/language, and cheap at repo scale.
Alternatives: pgvector, Turbopuffer
GitHub Apps get proper review API access (line comments, suggestions, request-changes) without a PAT.
Alternatives: Probot, Danger.js
Essential — run every prompt change against a golden set of historical PRs with known bugs before shipping.
Alternatives: Langfuse, Self-hosted + SWE-bench
Cost at each scale
Prototype
200 PRs/mo (one small team)
$45/mo
Startup
5,000 PRs/mo (50-eng org)
$1,180/mo
Scale
80,000 PRs/mo (enterprise, 800+ eng)
$14,200/mo
Latency budget
Tradeoffs
Coverage vs noise
Reviewing every line catches more bugs but drowns the signal. The better pattern is to only comment when confidence is high and severity is ≥ warning — shipping 3 real findings beats 30 nits every time. Tune the severity threshold against your team's tolerance.
Per-file vs whole-PR prompts
Per-file prompts are cheaper and parallelizable but miss cross-file bugs (a caller passing the wrong type). A hybrid works best: per-file Sonnet pass, then a whole-PR synthesis prompt only when files touch the same module.
Agentic retrieval vs static context
Letting the model call a 'read file' tool mid-review catches cross-file issues but adds 2–5s and sometimes loops. Pre-fetching related files via embedding retrieval is faster and deterministic, at the cost of occasionally missing obscure references.
Failure modes & guardrails
False-positive nitpicks train devs to ignore the bot
Mitigation: Ship with a strict severity floor (only 'bug' and 'security', never 'style'). Track comment acceptance rate per rule in Braintrust — auto-disable any rule under 40% acceptance after 100 impressions.
Agent hallucinates APIs or imports
Mitigation: Ground every code suggestion by requiring the model to cite the exact file and line it read. Reject any suggestion whose cited path is not in the repo (validate against a file manifest before posting).
Repo context retrieval returns wrong files
Mitigation: Rerank top-20 results with Cohere Rerank before passing to the reviewer. Filter retrieval by the PR's changed-language and ±2 directory levels from touched files.
Agent loops re-fetching files on large PRs
Mitigation: Hard cap at 6 tool-call rounds per file and 20 rounds per PR. If hit, fall back to static pre-fetched context and emit a metric — usually a sign the diff is too big and should be split.
Secrets or proprietary code leak to the LLM provider
Mitigation: Run a secret scanner (gitleaks) on the diff before sending. Route repos marked 'sensitive' through a self-hosted Llama 3.3 70B via Groq or an enterprise-tier AWS Bedrock endpoint with zero-retention.
Frequently asked questions
How much does an LLM code review agent cost per PR?
Expect $0.10–$0.40 per PR at startup scale with Sonnet 4 and smart triage. Enterprise volumes (80k PRs/month) land around $0.18 per PR with aggressive prompt caching and Haiku triage filtering ~30% of diffs out entirely.
Is Claude Sonnet 4 or GPT-4o better for code review in 2026?
Sonnet 4 leads on tool-use reliability and false-positive rate on code review evals. GPT-4o is competitive on single-file reasoning but produces more stylistic nits. Gemini 2.5 Pro is cheaper but weaker on multi-file refactors.
Should the agent request changes or only comment?
Only comment and leave merge decisions to humans, unless you have a specific high-confidence ruleset (e.g. 'never approve a PR that touches auth without a test'). Auto-blocking creates friction and erodes trust faster than any gain in rigor.
Do I need embeddings or can I just send the diff?
Diff-only works for self-contained PRs (~60% of them). For anything touching shared types, API contracts, or legacy code, you need retrieval — otherwise the agent hallucinates cross-file behavior. Start diff-only, add retrieval when false-positives hit 20%.
How do I prevent the agent from leaking proprietary code?
Use Anthropic's zero-retention enterprise tier or AWS Bedrock with a signed data processing agreement. For highly sensitive repos (crypto, healthcare), run Llama 3.3 70B on Groq or in-VPC via Together AI — expect ~15% quality drop vs Sonnet 4.
Can the agent replace human review?
No, and saying so will get your PR bot banned from the org. The working pattern is: agent catches the obvious stuff in 60s so the human reviewer focuses on design and intent. Merge decisions stay with humans.
What's the most common failure mode in production?
False-positive nits that tank signal-to-noise. Second: agent hallucinating a cross-file change that would break compilation. Both are mitigated by grounding (require file/line citations) and a strict severity floor — do not skip this.
Related
Architectures
QA Testing Agent
Reference architecture for an agent that generates test cases from code and requirements, runs them, and diagn...
RAG for Codebase Search
Reference architecture for natural-language Q&A over a 1M+ line codebase. Code-aware embeddings, tree-sitter A...
Enterprise Document Search
Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...