Reference Architecture · generation

Text-to-SQL Agent

Last updated: April 15, 2026

Quick answer

The production stack loads schema plus sample values into context, uses Claude Sonnet 4 or a fine-tuned text-to-SQL model, validates generated SQL with a dry-run EXPLAIN, and executes on a read-only role with an injected LIMIT clause. Expect $0.02 to $0.08 per query.

The problem

Non-technical users need to ask questions like 'what were our top 10 customers last quarter' and get a correct answer from a production database — without hallucinated column names, without DROP TABLE, and without leaking sensitive data to the LLM provider.

Architecture

input

llm

data

infra

output

Schema Loader

Fetches table names, column names, types, and constraints from information_schema. For large databases, retrieves only tables relevant to the query.

Alternatives: Full schema dump, Embedding-based schema retrieval for 1000+ table databases

Sample Value Injector

Fetches 3 to 5 sample values per column for columns likely to appear in filters. Dramatically reduces hallucinated enum values.

Alternatives: Skip (risky for filter-heavy queries), Full value enumeration for low-cardinality columns

SQL Generator

Generates SQL from the natural-language question using schema and sample values as context.

Alternatives: GPT-4o, DeepSeek-V3 (strong on SQL, cheaper)

SQL Validator

Runs EXPLAIN on the generated query before execution. Rejects queries with syntax errors, missing table references, or DDL statements.

Alternatives: sqlglot parse-only validation, LLM self-critique pass

Read-Only Executor

Executes the validated query on a read-only database role with LIMIT injection and a 10-second timeout.

Alternatives: Connection pool with max_rows setting, Query result cache layer

Result Formatter

Converts query results into a natural-language answer or table view. Redacts columns tagged as sensitive.

Alternatives: Raw table display, Chart renderer for numeric results

The stack

Schema retrievalinformation_schema query + embedding-based table selector for large DBs

For databases under 200 tables, dump the full schema. Above 200 tables, embed each table description and retrieve the top 10 by semantic similarity to the user query.

Alternatives: Full schema dump, LLM-guided table selection

SQL generatorClaude Sonnet 4

Best performance on complex JOIN queries, subqueries, and window functions in 2026. DeepSeek-V3 is 40% cheaper and competitive for single-table queries.

Alternatives: GPT-4o, DeepSeek-V3

ValidationEXPLAIN + sqlglot static analysis

EXPLAIN catches semantic errors (missing columns, wrong joins). sqlglot catches syntax errors and DDL statements before they reach the database.

Alternatives: EXPLAIN only, LLM self-critique

ExecutionRead-only Postgres role + LIMIT injection + 10s timeout

Read-only role prevents any write operations at the database level. LIMIT injection prevents runaway queries on large tables. A 10-second timeout prevents connection pool exhaustion.

Alternatives: Connection pool with max_rows, Query sandbox

Result formatLLM-generated natural language summary + raw table

Numbers in isolation are harder to act on. A one-sentence summary plus the raw data gives users both context and the ability to verify.

Alternatives: Raw table only, Chart renderer

Cost at each scale

Prototype

100 queries/day

$45/mo

LLM (Claude Sonnet 4)$30

Schema lookup overhead$3

Validation queries$2

Hosting$10

Startup

5k queries/day

$750/mo

LLM (Claude Sonnet 4)$500

Schema retrieval embeddings$50

Validation overhead$50

Hosting + observability$150

Scale

100k queries/day

$9,500/mo

LLM (Claude Sonnet 4, cached schema)$6,500

Schema retrieval embeddings$400

Caching layer$200

Hosting$1,500

Observability + evals$900

Latency budget

Total P50: 1,220ms

Total P95: 2,920ms

Schema retrieval

80ms · 200ms p95

SQL generation

900ms · 1800ms p95

Dry-run validation (EXPLAIN)

40ms · 120ms p95

Query execution

200ms · 800ms p95

Median

P95

Tradeoffs

Schema in context vs schema retrieval

For databases under 150 tables, put the full schema in context — it is simpler and gives the model full visibility. Above 150 tables, use embedding-based retrieval to select the 10 most relevant tables. The retrieval approach risks missing a necessary table; the full-context approach works until the schema exceeds the context window.

Fine-tuned model vs prompting

Fine-tuning a smaller model (Llama 3.1 8B or CodeLlama) on your specific schema can match Claude quality at 20% of the cost, but requires labeling 500+ query/SQL pairs and a training pipeline. For most teams, prompting Claude is faster to ship and easier to maintain.

Always add LIMIT, never allow DDL

These are non-negotiable guardrails. LIMIT prevents full-table scans that lock production. Blocking DDL (DROP, CREATE, ALTER, TRUNCATE) at the validation layer prevents the most common injection attack patterns.

Failure modes & guardrails

Hallucinated column names

Mitigation: Validate every column name the LLM emits against the actual schema before execution. For databases with ambiguous column naming conventions (e.g., both user_id and userId), include the full column list in the system prompt.

Wrong table join produces incorrect results

Mitigation: Add a self-critique step: after generating SQL, prompt the model to verify each JOIN condition makes semantic sense. Log queries where the join condition is on non-primary-key columns for manual review.

N+1 query patterns from generated SQL

Mitigation: Run EXPLAIN ANALYZE (not just EXPLAIN) on a representative sample of queries in staging. Alert when any query produces a sequential scan on a table with more than 100k rows.

PII columns exposed in results

Mitigation: Maintain a column-level sensitivity registry. Strip sensitive columns from the result set before passing to the formatter and before logging. Never log raw query results.

User asks ambiguous question, SQL is technically valid but semantically wrong

Mitigation: Return the generated SQL to the user alongside the answer. Add a 'Does this answer your question?' confirmation loop for queries involving aggregations over time ranges.

Frequently asked questions

Which LLM is best for text-to-SQL?

Claude Sonnet 4 leads on complex queries involving multiple JOINs, subqueries, and window functions. DeepSeek-V3 is competitive on single-table queries at roughly 40% lower cost. GPT-4o falls between the two. For production, test all three against a representative sample of your own queries.

How do I prevent destructive SQL queries?

Three layers: use a read-only database role at the connection level (no write privileges in the role), reject any query containing DDL keywords (DROP, CREATE, ALTER, TRUNCATE, DELETE) before execution, and add sqlglot static analysis to catch obfuscated DDL. Any one of these alone is insufficient.

How do I handle a 1000-table database?

Embed a description of each table (table name, column names, purpose) and retrieve the top 10 to 15 tables by cosine similarity to the user query. Feed only those tables' schemas into the prompt. This keeps context manageable and reduces hallucination on irrelevant tables.

How do I evaluate text-to-SQL quality?

Use execution accuracy: run the generated SQL and the gold-standard SQL independently, then compare result sets. A match means the model produced a semantically correct query even if the SQL differs syntactically. Track execution accuracy against a held-out set of 200 to 500 representative queries from your actual database.

Should I fine-tune a model on my database schema?

Fine-tuning makes sense when you have 500+ labeled query/SQL pairs, a stable schema, and volume above 50k queries per day. Below that threshold, the maintenance cost of a fine-tuning pipeline exceeds the savings from using a smaller model. Start with prompting Claude and add fine-tuning only when cost becomes a real constraint.

How do I deal with queries about data the user cannot access?

Filter at the schema level. Maintain a per-user or per-role schema view that shows only the tables and columns the user is permitted to query. Pass this restricted schema to the LLM. Never rely on the LLM to self-censor based on instructions — restrict at the database layer.

Architectures

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Models mentioned

claude-sonnet-4 gpt-4o deepseek-v3

Tools mentioned

pgvector