Reference Architecture · agent

Research Agent

Last updated: April 16, 2026

Quick answer

The production stack uses Claude Opus 4 or Sonnet 4 as the planner/synthesizer, Exa + Firecrawl for search + content extraction, and a strict citation-grounding pass that rejects any claim not traceable to a fetched source. Expect $0.30–$2.50 per research task and 30s–3min latency depending on depth. Opus 4 with tool use and extended thinking produces research quality that matches a junior analyst on most topics.

The problem

Users want deep answers that a chatbot can't give — 'compare the top 5 open-source vector databases on real 2026 benchmarks' or 'what's the actual regulatory status of AI in the EU as of this week'. You need an agent that plans a research path, searches the web, reads primary sources, synthesizes, and cites. The hard parts are staying grounded (no fabricated stats), handling contradictions between sources, and knowing when to stop (agents happily research forever).

Architecture

User QueryINPUTResearch PlannerLLMWeb SearchINFRAContent FetcherINFRASource RerankerINFRAResearch Worker (parallel)LLMSynthesizerLLMCitation VerifierINFRA
input
llm
data
infra
output

User Query

Natural language research question, optionally with depth preference (quick / standard / deep).

Alternatives: Structured form, API endpoint

Research Planner

Decomposes query into sub-questions, decides search strategy (broad → narrow), sets budget (max searches, max time).

Alternatives: Claude Sonnet 4, GPT-4o reasoning, Gemini 2.5 Pro Thinking

Web Search

Neural search via Exa for semantic queries, supplemented by Tavily or Serper for recency-critical queries.

Alternatives: Tavily, Serper, Brave Search API, You.com

Content Fetcher

Fetches and extracts clean markdown from URLs. Handles JS rendering, PDFs, paywalls (partial).

Alternatives: Jina Reader, ScrapingBee, Playwright + Readability

Source Reranker

Reranks fetched content by relevance to the sub-question before passing to synthesizer. Prevents context dilution.

Alternatives: Voyage Rerank-2, LLM-as-reranker

Research Worker (parallel)

For each sub-question, reads top sources, extracts facts with citations, flags contradictions.

Alternatives: GPT-4o, Gemini 2.5 Pro

Synthesizer

Merges sub-question findings into a coherent answer, resolves contradictions, adds inline citations.

Alternatives: Claude Sonnet 4, GPT-4o, Gemini 2.5 Pro

Citation Verifier

Deterministically checks that every factual claim in the output maps to a cited source URL and that the quote substring exists in the fetched content.

Alternatives: LLM-as-judge, Human review queue

The stack

Planner / synthesizerClaude Opus 4

Research quality scales with planner intelligence — Opus 4 decomposes ambiguous queries and resolves contradictions noticeably better than Sonnet on complex topics. Cost is justified for the planning + synthesis endpoints only.

Alternatives: Claude Sonnet 4, GPT-4o reasoning, Gemini 2.5 Pro Thinking

Research workerClaude Sonnet 4

Workers run in parallel across sub-questions. Sonnet 4's cost profile fits the parallelism, and tool-use reliability on reading long PDFs is strong.

Alternatives: GPT-4o, Gemini 2.5 Pro

Neural searchExa

Exa's semantic search surfaces the right long-tail sources ('papers about X that cite Y'). Tavily is strong for recency. Use both — they miss different things.

Alternatives: Tavily, You.com, Brave Search API

Content fetcherFirecrawl

Firecrawl's clean markdown extraction and JS rendering handle the majority of the web. For paywalled sources, combine with a user-supplied subscription cookie or skip.

Alternatives: Jina Reader, ScrapingBee, Crawl4AI

RerankerCohere Rerank 3.5

Without a rerank step, the synthesizer reads irrelevant content and hallucinates more. Rerank is the single biggest lever on research quality for <$0.001/query.

Alternatives: Voyage Rerank-2, Jina Reranker

OrchestrationLangGraph (with critique) or custom

LangGraph helps with parallel sub-question fanout but adds latency. A 200-line custom orchestration beats it for production reliability. LangGraph is useful for prototyping the DAG.

Alternatives: Mastra, Plain async Python/TS

Citation storePostgres JSONB

Store every fetched source + rerank score + cited passages. Enables 'why did you cite this' debugging and re-runs without re-fetching.

Alternatives: SQLite, MongoDB

Cost at each scale

Prototype

200 research tasks/mo

$80/mo

Opus 4 planner + synth$30
Sonnet 4 workers$20
Exa + Tavily$12
Firecrawl$8
Cohere Rerank$4
Infra + observability$6

Startup

15,000 tasks/mo

$6,200/mo

Opus 4 planner + synth$2,100
Sonnet 4 workers (cached)$1,500
Exa + Tavily$950
Firecrawl$700
Cohere Rerank$350
Postgres + infra$280
Observability + evals$320

Scale

400,000 tasks/mo

$142,000/mo

Opus 4 planner + synth (cached)$52,000
Sonnet 4 workers (heavy caching)$38,000
Exa + Tavily enterprise$18,000
Firecrawl enterprise$14,000
Cohere Rerank volume$6,500
Postgres + infra$7,500
Observability + evals$6,000

Latency budget

Total P50: 27,250ms
Total P95: 57,600ms
Planner decomposition
3500ms · 7000ms p95
Parallel search (3–5 queries)
1800ms · 3500ms p95
Fetch + extract (top-20 URLs)
6000ms · 15000ms p95
Rerank pass
450ms · 1100ms p95
Worker synthesis (parallel per sub-Q)
4500ms · 9000ms p95
Opus 4 final synthesis
11000ms · 22000ms p95
Median
P95

Tradeoffs

Depth vs latency

Going from 3 to 10 sources per sub-question marginally improves answer quality but triples latency and cost. For user-facing UX, default to 'quick' mode (3 sources, ~30s) and expose 'deep' mode (10+ sources, 2–3min) as an explicit choice. Don't default to deep — the latency kills engagement.

Opus 4 vs Sonnet 4 for synthesis

Opus 4 produces measurably better synthesis on contested or multi-source-contradicting topics. On clear-cut factual research, Sonnet 4 is 80% of the quality at 20% of the cost. Route by query complexity: use a classifier to pick the synth model.

Real-time vs cached research

For 'what is X' style queries, cached research from the last week is often fresh enough and effectively free. For 'what happened this week' queries, always re-fetch. Maintain a staleness-by-topic table — news is stale in hours, academic topics are fresh for months.

Failure modes & guardrails

Hallucinated statistics and made-up sources

Mitigation: Hard-require: every numeric claim, quote, or specific fact must include a citation URL. Deterministically verify that the cited quote appears as a substring (fuzzy match >0.85) in the fetched source content. Reject and retry if not.

Agent searches forever / blows the budget

Mitigation: Enforce hard caps: max 6 search rounds, max 25 URLs fetched, max $0.50 per task (configurable). Emit a metric when budget is hit — usually indicates a badly decomposed plan. Return partial answer with explicit 'insufficient data' flag.

Contradictory sources, agent picks one silently

Mitigation: Worker prompts must tag contradictions explicitly ('Source A claims X; Source B claims Y'). Synthesizer either surfaces the contradiction to the user or picks by source authority (peer-reviewed > news > blog) and labels the confidence.

Paywalled or low-quality sources dominate results

Mitigation: Maintain a domain quality list (allow-list academic/gov, deprioritize content farms). Filter at search time — Exa supports domain filters. Skip paywalls unless the user has provided credentials; do not guess at the content behind them.

Outdated sources fed into a 'current state' question

Mitigation: Detect temporal language in the query ('current', 'latest', '2026') and filter search results to last 90 days. Rerank with date boost. Surface the freshest source's date explicitly in the answer so the user knows the recency bound.

Frequently asked questions

Is this the same as Perplexity?

Architecturally yes — Perplexity is essentially a hosted research agent with this shape. The difference is control: building your own lets you tune for your domain, use your own allow-listed sources, integrate with internal data, and avoid the $20/mo/user pricing at scale.

Opus 4 or Sonnet 4 for the planner?

Opus 4 meaningfully decomposes harder queries better — particularly compound or contested ones. For well-scoped factual research, Sonnet 4 is fine. Route by query complexity: use Haiku 4 as a quick 'is this simple or complex' classifier and pick accordingly.

Do I need both Exa and Tavily?

Exa is best for semantic / research queries ('papers about X'). Tavily is best for recency-sensitive queries ('news today'). They miss different things. You can start with just Exa and add Tavily once you see recency gaps.

How do I keep research fresh without re-fetching every time?

Cache the final research answer keyed on normalized query + date bucket. Expire news-ish queries in hours, evergreen queries in weeks. On cache miss, re-fetch only the URLs whose Last-Modified > cached timestamp.

What's the typical latency for a research task?

Quick mode (3 sources, simple query): 15–30s. Standard (5–7 sources): 30–60s. Deep (10+ sources, multi-step decomposition with Opus 4): 90s–3min. Stream partial results to keep users engaged during deep research.

Can I trust the citations?

Only with the deterministic citation verifier in place. Without it, even grounded LLMs hallucinate ~5–15% of citations (wrong URL, fabricated quote, misattributed). With substring verification, fabrication drops to <1%, and the remaining errors are attribution mistakes that are easier to catch in eval.

How does it handle contradictory sources?

Two patterns: (1) surface the contradiction to the user ('Source A says X; Source B says Y') — often the honest answer, and (2) weight by source authority (peer-reviewed papers > reputable news > blogs). Never silently pick one side without disclosure.

Related