Reference Architecture · agent
Research Agent
Last updated: April 16, 2026
Quick answer
The production stack uses Claude Opus 4 or Sonnet 4 as the planner/synthesizer, Exa + Firecrawl for search + content extraction, and a strict citation-grounding pass that rejects any claim not traceable to a fetched source. Expect $0.30–$2.50 per research task and 30s–3min latency depending on depth. Opus 4 with tool use and extended thinking produces research quality that matches a junior analyst on most topics.
The problem
Users want deep answers that a chatbot can't give — 'compare the top 5 open-source vector databases on real 2026 benchmarks' or 'what's the actual regulatory status of AI in the EU as of this week'. You need an agent that plans a research path, searches the web, reads primary sources, synthesizes, and cites. The hard parts are staying grounded (no fabricated stats), handling contradictions between sources, and knowing when to stop (agents happily research forever).
Architecture
User Query
Natural language research question, optionally with depth preference (quick / standard / deep).
Alternatives: Structured form, API endpoint
Research Planner
Decomposes query into sub-questions, decides search strategy (broad → narrow), sets budget (max searches, max time).
Alternatives: Claude Sonnet 4, GPT-4o reasoning, Gemini 2.5 Pro Thinking
Web Search
Neural search via Exa for semantic queries, supplemented by Tavily or Serper for recency-critical queries.
Alternatives: Tavily, Serper, Brave Search API, You.com
Content Fetcher
Fetches and extracts clean markdown from URLs. Handles JS rendering, PDFs, paywalls (partial).
Alternatives: Jina Reader, ScrapingBee, Playwright + Readability
Source Reranker
Reranks fetched content by relevance to the sub-question before passing to synthesizer. Prevents context dilution.
Alternatives: Voyage Rerank-2, LLM-as-reranker
Research Worker (parallel)
For each sub-question, reads top sources, extracts facts with citations, flags contradictions.
Alternatives: GPT-4o, Gemini 2.5 Pro
Synthesizer
Merges sub-question findings into a coherent answer, resolves contradictions, adds inline citations.
Alternatives: Claude Sonnet 4, GPT-4o, Gemini 2.5 Pro
Citation Verifier
Deterministically checks that every factual claim in the output maps to a cited source URL and that the quote substring exists in the fetched content.
Alternatives: LLM-as-judge, Human review queue
The stack
Research quality scales with planner intelligence — Opus 4 decomposes ambiguous queries and resolves contradictions noticeably better than Sonnet on complex topics. Cost is justified for the planning + synthesis endpoints only.
Alternatives: Claude Sonnet 4, GPT-4o reasoning, Gemini 2.5 Pro Thinking
Workers run in parallel across sub-questions. Sonnet 4's cost profile fits the parallelism, and tool-use reliability on reading long PDFs is strong.
Alternatives: GPT-4o, Gemini 2.5 Pro
Exa's semantic search surfaces the right long-tail sources ('papers about X that cite Y'). Tavily is strong for recency. Use both — they miss different things.
Alternatives: Tavily, You.com, Brave Search API
Firecrawl's clean markdown extraction and JS rendering handle the majority of the web. For paywalled sources, combine with a user-supplied subscription cookie or skip.
Alternatives: Jina Reader, ScrapingBee, Crawl4AI
Without a rerank step, the synthesizer reads irrelevant content and hallucinates more. Rerank is the single biggest lever on research quality for <$0.001/query.
Alternatives: Voyage Rerank-2, Jina Reranker
LangGraph helps with parallel sub-question fanout but adds latency. A 200-line custom orchestration beats it for production reliability. LangGraph is useful for prototyping the DAG.
Alternatives: Mastra, Plain async Python/TS
Store every fetched source + rerank score + cited passages. Enables 'why did you cite this' debugging and re-runs without re-fetching.
Alternatives: SQLite, MongoDB
Cost at each scale
Prototype
200 research tasks/mo
$80/mo
Startup
15,000 tasks/mo
$6,200/mo
Scale
400,000 tasks/mo
$142,000/mo
Latency budget
Tradeoffs
Depth vs latency
Going from 3 to 10 sources per sub-question marginally improves answer quality but triples latency and cost. For user-facing UX, default to 'quick' mode (3 sources, ~30s) and expose 'deep' mode (10+ sources, 2–3min) as an explicit choice. Don't default to deep — the latency kills engagement.
Opus 4 vs Sonnet 4 for synthesis
Opus 4 produces measurably better synthesis on contested or multi-source-contradicting topics. On clear-cut factual research, Sonnet 4 is 80% of the quality at 20% of the cost. Route by query complexity: use a classifier to pick the synth model.
Real-time vs cached research
For 'what is X' style queries, cached research from the last week is often fresh enough and effectively free. For 'what happened this week' queries, always re-fetch. Maintain a staleness-by-topic table — news is stale in hours, academic topics are fresh for months.
Failure modes & guardrails
Hallucinated statistics and made-up sources
Mitigation: Hard-require: every numeric claim, quote, or specific fact must include a citation URL. Deterministically verify that the cited quote appears as a substring (fuzzy match >0.85) in the fetched source content. Reject and retry if not.
Agent searches forever / blows the budget
Mitigation: Enforce hard caps: max 6 search rounds, max 25 URLs fetched, max $0.50 per task (configurable). Emit a metric when budget is hit — usually indicates a badly decomposed plan. Return partial answer with explicit 'insufficient data' flag.
Contradictory sources, agent picks one silently
Mitigation: Worker prompts must tag contradictions explicitly ('Source A claims X; Source B claims Y'). Synthesizer either surfaces the contradiction to the user or picks by source authority (peer-reviewed > news > blog) and labels the confidence.
Paywalled or low-quality sources dominate results
Mitigation: Maintain a domain quality list (allow-list academic/gov, deprioritize content farms). Filter at search time — Exa supports domain filters. Skip paywalls unless the user has provided credentials; do not guess at the content behind them.
Outdated sources fed into a 'current state' question
Mitigation: Detect temporal language in the query ('current', 'latest', '2026') and filter search results to last 90 days. Rerank with date boost. Surface the freshest source's date explicitly in the answer so the user knows the recency bound.
Frequently asked questions
Is this the same as Perplexity?
Architecturally yes — Perplexity is essentially a hosted research agent with this shape. The difference is control: building your own lets you tune for your domain, use your own allow-listed sources, integrate with internal data, and avoid the $20/mo/user pricing at scale.
Opus 4 or Sonnet 4 for the planner?
Opus 4 meaningfully decomposes harder queries better — particularly compound or contested ones. For well-scoped factual research, Sonnet 4 is fine. Route by query complexity: use Haiku 4 as a quick 'is this simple or complex' classifier and pick accordingly.
Do I need both Exa and Tavily?
Exa is best for semantic / research queries ('papers about X'). Tavily is best for recency-sensitive queries ('news today'). They miss different things. You can start with just Exa and add Tavily once you see recency gaps.
How do I keep research fresh without re-fetching every time?
Cache the final research answer keyed on normalized query + date bucket. Expire news-ish queries in hours, evergreen queries in weeks. On cache miss, re-fetch only the URLs whose Last-Modified > cached timestamp.
What's the typical latency for a research task?
Quick mode (3 sources, simple query): 15–30s. Standard (5–7 sources): 30–60s. Deep (10+ sources, multi-step decomposition with Opus 4): 90s–3min. Stream partial results to keep users engaged during deep research.
Can I trust the citations?
Only with the deterministic citation verifier in place. Without it, even grounded LLMs hallucinate ~5–15% of citations (wrong URL, fabricated quote, misattributed). With substring verification, fabrication drops to <1%, and the remaining errors are attribution mistakes that are easier to catch in eval.
How does it handle contradictory sources?
Two patterns: (1) surface the contradiction to the user ('Source A says X; Source B says Y') — often the honest answer, and (2) weight by source authority (peer-reviewed papers > reputable news > blogs). Never silently pick one side without disclosure.
Related
Architectures
Enterprise Document Search
Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...
Data Analyst Agent
Reference architecture for a natural-language-to-SQL agent that queries tabular data, generates charts, and pr...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...