Reference Architecture · agent

Data Analyst Agent

Q: Which LLM is best for text-to-SQL in 2026?

Claude Sonnet 4 leads on public text-to-SQL benchmarks (Spider 2, BIRD) and real-world evals. GPT-4o is within a couple percentage points. Gemini 2.5 Pro is behind on complex multi-CTE queries. Fine-tuned open-weight models can match Sonnet on narrow domains but cost more to maintain.

Q: Do I need a semantic layer?

For anything beyond toy demos, yes. Raw-schema text-to-SQL peaks around 55–65% accuracy on real business questions because of ambiguous column semantics. A semantic layer (Cube.js, dbt Semantic Layer, LookML) encodes your definitions and pushes accuracy to 85–95%.

Q: Can the agent update data (INSERT/UPDATE)?

Don't let it. Use a read-only warehouse role. For analytics agents, write operations are out of scope and every incident in this category gets exfiltrated on Twitter. If you truly need writes, require explicit user confirmation per query and log everything.

Q: What's the right accuracy bar for production?

For internal exploration, 80%+ accuracy is usable (analysts verify). For customer-facing or executive dashboards, 95%+ — get there with semantic layer, self-critique, and a curated few-shot example bank. Anything below 80% creates more confusion than value.

Q: How do you handle ambiguous questions?

Detect ambiguity at the router stage (two plausible interpretations, undefined term). Ask a single clarifying question before running SQL. Example: 'by active users, do you mean DAU, WAU, or MAU?' — one question saves a wrong chart.

Q: Self-critique loop — how many retries?

Cap at 2 retries. Most real issues are caught on the first retry. Beyond that you're hallucinating fixes or fighting an ambiguous question. Emit a metric when the cap is hit and present the best attempt with a confidence caveat.

Q: Does it replace data analysts?

No. It replaces the first hour of their day (metric lookups, 'what was the number for Q3'). Analysts do the harder work: defining what to measure, investigating anomalies, communicating findings. The agent gives them back 30–50% of their time for higher-value work.

Last updated: April 16, 2026

Quick answer

The production stack uses Claude Sonnet 4 for SQL generation grounded in a curated semantic layer, a sandboxed DuckDB/Snowflake executor, and a self-critique pass where the model reviews query + result before presenting. Expect $0.05–$0.30 per question, 4–15s end-to-end latency, and 85–95% accuracy on well-curated schemas. Never let the agent query raw warehouse tables without a semantic layer — accuracy drops below 60%.

The problem

Every team has data in warehouses that only a handful of people can query. You need an agent that takes 'how did revenue trend in Europe last quarter by product' and returns a correct SQL query, a chart, and a written insight — with the schema knowledge and business context to not produce confident nonsense. The hard parts are ambiguous column semantics (what counts as 'active user'?), joining across messy fact/dim tables, and catching subtle errors like wrong date filters.

Architecture

input

llm

data

infra

output

User Question

Natural language: 'top 10 customers by revenue in EMEA last quarter'. May reference a dashboard or chart.

Alternatives: Slack, Dashboard chat, API

Intent Router

Classifies as metric lookup / trend / anomaly / freeform. Routes to templated query path for common patterns.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Semantic Layer

Curated metrics, dimensions, join paths. Agent queries this, not raw tables. Cube.js, dbt Semantic Layer, or MetricFlow.

Alternatives: dbt Semantic Layer, LookML, Custom YAML

Schema Retrieval

Embedding search over table/column descriptions, sample values, and past successful queries to inject relevant context.

Alternatives: Pinecone, Qdrant

SQL Generator

Generates SQL (or semantic-layer query) with retrieved schema and past examples. Uses structured output with named parameters.

Alternatives: GPT-4o, Gemini 2.5 Pro

Query Executor

Read-only sandbox against the warehouse. Hard row limits, timeout, no write/drop permissions.

Alternatives: Snowflake, BigQuery, DuckDB, Postgres read replica

Result Critique

Reviews SQL + result summary for obvious errors (empty result, wrong date window, cross-join blow-up). Retries with fix on failure.

Alternatives: Separate Opus 4 critic, Rule-based validator

Chart + Narrative Generator

Picks chart type, generates Vega-Lite spec, writes a 2–3 sentence insight with year-over-year framing.

Alternatives: GPT-4o, Gemini 2.5 Pro

Chart + Answer

Rendered chart, data table, written insight, and the underlying SQL (shown collapsed by default).

Alternatives: Slack post, Dashboard embed, Notion block

The stack

SQL generatorClaude Sonnet 4

Sonnet 4 is the strongest general-purpose text-to-SQL model in 2026 evals. GPT-4o is close. Gemini 2.5 Pro is weaker on multi-CTE queries.

Alternatives: GPT-4o, Gemini 2.5 Pro

Semantic layerCube.js or dbt Semantic Layer

LLMs hallucinate metrics when they have to compose them from raw tables. A semantic layer makes 'MRR' a real, unambiguous primitive. This single change moves accuracy from ~60% to ~90% on typical business questions.

Alternatives: LookML, MetricFlow, Custom YAML

WarehouseSnowflake or BigQuery

Cloud warehouses give you role-based scoping and query governance. DuckDB is great for embedded analytics on <100GB datasets.

Alternatives: Postgres read replica, DuckDB, Databricks

Schema retrievalpgvector + column metadata

Embeddings over table/column descriptions plus past-successful-queries outperform schema-dumping in the prompt. Cheap, fast, and keeps context small.

Alternatives: Pinecone namespace per warehouse, Qdrant

Executor sandboxRead-only warehouse role

Never let the agent run write queries. Explicit role with only SELECT on exposed schemas. Query timeout at 30s, row limit at 10k.

Alternatives: Materialized view cache, DuckDB on snapshot

Chart renderingVega-Lite + React wrapper

Vega-Lite's declarative spec is easy for the LLM to generate, and renders in any JS environment. Plotly works but its spec is more verbose and LLMs drift more.

Alternatives: Plotly, Observable Plot, Recharts

EvalsBraintrust with SQL golden set

Eval on a curated set of 100–500 real business questions with known-correct SQL. Regression-test every prompt change — accuracy is non-monotonic with prompt tweaks.

Alternatives: Langfuse, Custom eval harness

Cost at each scale

Prototype

500 queries/mo

$45/mo

Sonnet 4 generator + critique$18

Haiku 4 router$1

Embeddings$2

Supabase + pgvector$0

Warehouse compute (shared)$20

Infra$4

Startup

30,000 queries/mo

$3,400/mo

Sonnet 4 (with caching)$1,400

Haiku 4 router$70

Embeddings + rerank$150

Supabase Pro$120

Snowflake compute$1,200

Cube.js Cloud$300

Observability + evals$160

Scale

1,000,000 queries/mo

$85,000/mo

Sonnet 4 (heavy caching)$32,000

Haiku 4 router$1,800

Embeddings + rerank$3,500

Snowflake compute (dedicated WH)$28,000

Cube.js Enterprise$9,000

Postgres + vector infra$4,500

Observability + evals$6,200

Latency budget

Total P50: 8,350ms

Total P95: 20,100ms

Router + schema retrieval

400ms · 900ms p95

Sonnet SQL generation

2500ms · 5000ms p95

Warehouse execution

1800ms · 7000ms p95

Result critique

1200ms · 2400ms p95

Chart + narrative generation

2200ms · 4200ms p95

Render + deliver

250ms · 600ms p95

Median

P95

Tradeoffs

Raw SQL vs semantic layer

Letting the agent query raw tables gives unlimited flexibility but accuracy drops below 60% on anything beyond simple aggregations. A semantic layer constrains the problem (here are the metrics, here are the dimensions, here are the valid joins) and accuracy jumps to 90%+. Build the semantic layer first or accept unreliable results.

Self-critique vs single-shot

Self-critique (run the query, review the result, retry on detected issues) adds 1–3s and ~30% cost but catches silent failures like empty results from wrong date filters. For production use, ship with self-critique. For internal dashboards, single-shot is fine.

Warehouse vs DuckDB

DuckDB on exported snapshots is dramatically cheaper and faster for read-only analytics on <100GB. Snowflake/BigQuery are necessary at petabyte scale or when you need fresh data. Hybrid is viable: DuckDB for exploratory, warehouse for ground-truth refreshes.

Failure modes & guardrails

Query returns 0 rows silently, agent presents 'no data' as answer

Mitigation: In the critique pass, flag any aggregate returning 0 rows or NULL as suspicious. Re-examine the WHERE clause, especially date filters and string equality. Present 'no matching data with current filter' explicitly rather than an empty chart.

Cartesian join blows up query to billions of rows

Mitigation: Enforce a hard row limit (EXPLAIN + abort above 100M estimated rows). Require the semantic layer to define valid join paths — never let the agent invent a join key. Set warehouse query timeout at 30s and cost limit per query.

Wrong column semantics — 'users' means signups, not MAU

Mitigation: Every dimension and metric in the semantic layer must have a human-readable description and sample values. Include the 5 most semantically similar past successful queries as few-shot examples. Ambiguous terms ('users', 'revenue') should trigger a clarification question before execution.

Chart type mismatches data shape

Mitigation: Deterministic chart-type picker based on result shape: 1 numeric + time = line, 2 categorical + numeric = bar, 1 categorical + numeric = bar or pie (≤5 categories), 2 numeric = scatter. LLM only tunes titles and labels; it does not choose the chart type.

Leaked PII or restricted data in results

Mitigation: Use warehouse row-level security + column masking. Route queries through a role tied to the requesting user's ACLs. Audit-log every query. Never let the agent access tables marked 'restricted' without an explicit override + approver.

Frequently asked questions

Which LLM is best for text-to-SQL in 2026?

Claude Sonnet 4 leads on public text-to-SQL benchmarks (Spider 2, BIRD) and real-world evals. GPT-4o is within a couple percentage points. Gemini 2.5 Pro is behind on complex multi-CTE queries. Fine-tuned open-weight models can match Sonnet on narrow domains but cost more to maintain.

Do I need a semantic layer?

For anything beyond toy demos, yes. Raw-schema text-to-SQL peaks around 55–65% accuracy on real business questions because of ambiguous column semantics. A semantic layer (Cube.js, dbt Semantic Layer, LookML) encodes your definitions and pushes accuracy to 85–95%.

Can the agent update data (INSERT/UPDATE)?

Don't let it. Use a read-only warehouse role. For analytics agents, write operations are out of scope and every incident in this category gets exfiltrated on Twitter. If you truly need writes, require explicit user confirmation per query and log everything.

What's the right accuracy bar for production?

For internal exploration, 80%+ accuracy is usable (analysts verify). For customer-facing or executive dashboards, 95%+ — get there with semantic layer, self-critique, and a curated few-shot example bank. Anything below 80% creates more confusion than value.

How do you handle ambiguous questions?

Detect ambiguity at the router stage (two plausible interpretations, undefined term). Ask a single clarifying question before running SQL. Example: 'by active users, do you mean DAU, WAU, or MAU?' — one question saves a wrong chart.

Self-critique loop — how many retries?

Cap at 2 retries. Most real issues are caught on the first retry. Beyond that you're hallucinating fixes or fighting an ambiguous question. Emit a metric when the cap is hit and present the best attempt with a confidence caveat.

Does it replace data analysts?

No. It replaces the first hour of their day (metric lookups, 'what was the number for Q3'). Analysts do the harder work: defining what to measure, investigating anomalies, communicating findings. The agent gives them back 30–50% of their time for higher-value work.

Architectures

Research Agent

Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o

Tools mentioned

cube-js snowflake pgvector

Data Analyst Agent

The problem

Architecture

The stack

Cost at each scale

Latency budget

Tradeoffs

Failure modes & guardrails

Frequently asked questions

Related

Architectures

Models mentioned

Tools mentioned