Reference Architecture · generation
Text-to-SQL Agent
Last updated: April 15, 2026
Quick answer
The production stack loads schema plus sample values into context, uses Claude Sonnet 4 or a fine-tuned text-to-SQL model, validates generated SQL with a dry-run EXPLAIN, and executes on a read-only role with an injected LIMIT clause. Expect $0.02 to $0.08 per query.
The problem
Non-technical users need to ask questions like 'what were our top 10 customers last quarter' and get a correct answer from a production database — without hallucinated column names, without DROP TABLE, and without leaking sensitive data to the LLM provider.
Architecture
Schema Loader
Fetches table names, column names, types, and constraints from information_schema. For large databases, retrieves only tables relevant to the query.
Alternatives: Full schema dump, Embedding-based schema retrieval for 1000+ table databases
Sample Value Injector
Fetches 3 to 5 sample values per column for columns likely to appear in filters. Dramatically reduces hallucinated enum values.
Alternatives: Skip (risky for filter-heavy queries), Full value enumeration for low-cardinality columns
SQL Generator
Generates SQL from the natural-language question using schema and sample values as context.
Alternatives: GPT-4o, DeepSeek-V3 (strong on SQL, cheaper)
SQL Validator
Runs EXPLAIN on the generated query before execution. Rejects queries with syntax errors, missing table references, or DDL statements.
Alternatives: sqlglot parse-only validation, LLM self-critique pass
Read-Only Executor
Executes the validated query on a read-only database role with LIMIT injection and a 10-second timeout.
Alternatives: Connection pool with max_rows setting, Query result cache layer
Result Formatter
Converts query results into a natural-language answer or table view. Redacts columns tagged as sensitive.
Alternatives: Raw table display, Chart renderer for numeric results
The stack
For databases under 200 tables, dump the full schema. Above 200 tables, embed each table description and retrieve the top 10 by semantic similarity to the user query.
Alternatives: Full schema dump, LLM-guided table selection
Best performance on complex JOIN queries, subqueries, and window functions in 2026. DeepSeek-V3 is 40% cheaper and competitive for single-table queries.
Alternatives: GPT-4o, DeepSeek-V3
EXPLAIN catches semantic errors (missing columns, wrong joins). sqlglot catches syntax errors and DDL statements before they reach the database.
Alternatives: EXPLAIN only, LLM self-critique
Read-only role prevents any write operations at the database level. LIMIT injection prevents runaway queries on large tables. A 10-second timeout prevents connection pool exhaustion.
Alternatives: Connection pool with max_rows, Query sandbox
Numbers in isolation are harder to act on. A one-sentence summary plus the raw data gives users both context and the ability to verify.
Alternatives: Raw table only, Chart renderer
Cost at each scale
Prototype
100 queries/day
$45/mo
Startup
5k queries/day
$750/mo
Scale
100k queries/day
$9,500/mo
Latency budget
Tradeoffs
Schema in context vs schema retrieval
For databases under 150 tables, put the full schema in context — it is simpler and gives the model full visibility. Above 150 tables, use embedding-based retrieval to select the 10 most relevant tables. The retrieval approach risks missing a necessary table; the full-context approach works until the schema exceeds the context window.
Fine-tuned model vs prompting
Fine-tuning a smaller model (Llama 3.1 8B or CodeLlama) on your specific schema can match Claude quality at 20% of the cost, but requires labeling 500+ query/SQL pairs and a training pipeline. For most teams, prompting Claude is faster to ship and easier to maintain.
Always add LIMIT, never allow DDL
These are non-negotiable guardrails. LIMIT prevents full-table scans that lock production. Blocking DDL (DROP, CREATE, ALTER, TRUNCATE) at the validation layer prevents the most common injection attack patterns.
Failure modes & guardrails
Hallucinated column names
Mitigation: Validate every column name the LLM emits against the actual schema before execution. For databases with ambiguous column naming conventions (e.g., both user_id and userId), include the full column list in the system prompt.
Wrong table join produces incorrect results
Mitigation: Add a self-critique step: after generating SQL, prompt the model to verify each JOIN condition makes semantic sense. Log queries where the join condition is on non-primary-key columns for manual review.
N+1 query patterns from generated SQL
Mitigation: Run EXPLAIN ANALYZE (not just EXPLAIN) on a representative sample of queries in staging. Alert when any query produces a sequential scan on a table with more than 100k rows.
PII columns exposed in results
Mitigation: Maintain a column-level sensitivity registry. Strip sensitive columns from the result set before passing to the formatter and before logging. Never log raw query results.
User asks ambiguous question, SQL is technically valid but semantically wrong
Mitigation: Return the generated SQL to the user alongside the answer. Add a 'Does this answer your question?' confirmation loop for queries involving aggregations over time ranges.
Frequently asked questions
Which LLM is best for text-to-SQL?
Claude Sonnet 4 leads on complex queries involving multiple JOINs, subqueries, and window functions. DeepSeek-V3 is competitive on single-table queries at roughly 40% lower cost. GPT-4o falls between the two. For production, test all three against a representative sample of your own queries.
How do I prevent destructive SQL queries?
Three layers: use a read-only database role at the connection level (no write privileges in the role), reject any query containing DDL keywords (DROP, CREATE, ALTER, TRUNCATE, DELETE) before execution, and add sqlglot static analysis to catch obfuscated DDL. Any one of these alone is insufficient.
How do I handle a 1000-table database?
Embed a description of each table (table name, column names, purpose) and retrieve the top 10 to 15 tables by cosine similarity to the user query. Feed only those tables' schemas into the prompt. This keeps context manageable and reduces hallucination on irrelevant tables.
How do I evaluate text-to-SQL quality?
Use execution accuracy: run the generated SQL and the gold-standard SQL independently, then compare result sets. A match means the model produced a semantically correct query even if the SQL differs syntactically. Track execution accuracy against a held-out set of 200 to 500 representative queries from your actual database.
Should I fine-tune a model on my database schema?
Fine-tuning makes sense when you have 500+ labeled query/SQL pairs, a stable schema, and volume above 50k queries per day. Below that threshold, the maintenance cost of a fine-tuning pipeline exceeds the savings from using a smaller model. Start with prompting Claude and add fine-tuning only when cost becomes a real constraint.
How do I deal with queries about data the user cannot access?
Filter at the schema level. Maintain a per-user or per-role schema view that shows only the tables and columns the user is permitted to query. Pass this restricted schema to the LLM. Never rely on the LLM to self-censor based on instructions — restrict at the database layer.