Reference Architecture · generation

Invoice Structured Extraction

Last updated: April 16, 2026

Quick answer

The production stack does OCR only when needed (native-text PDFs skip it), sends the text plus the raw page image to a vision-capable model like Claude Sonnet 4 or Gemini 2.5 Pro with a Pydantic or Zod schema for JSON mode, then runs deterministic validators (subtotal + tax = total, line-item sum = subtotal) before accepting the result. Expect $0.01 to $0.04 per invoice depending on page count, with near-zero math errors and 95 to 99 percent field-level accuracy.

The problem

You receive thousands of supplier invoices per month — PDFs, scans, photos, EDI — and need to push clean structured data into your AP system. Layouts vary wildly, line items wrap unpredictably, totals must reconcile to the penny, and a wrong vendor ID costs the finance team an hour to untangle. The system must extract with >98 percent field accuracy and flag anything it cannot verify arithmetically.

Architecture

input

llm

data

infra

output

Invoice Ingest

Accepts PDFs, TIFFs, JPGs, and email attachments. Normalizes to PDF and bursts multi-page invoices into per-page images.

Alternatives: Email forwarding (Mailgun, SendGrid Inbound), S3 drop zone, Direct vendor portal upload

Document Classifier

A fast vision model confirms the document is actually an invoice (not a PO, statement, or junk) and detects text-native vs scanned.

Alternatives: GPT-4o-mini vision, Claude Haiku 4, Rule-based keyword filter as first pass

OCR Fallback

Runs on scanned or image-only invoices. Produces text with bounding-box coordinates that can be reconciled with layout features.

Alternatives: AWS Textract (best for US tax forms), Google Document AI, Azure Form Recognizer, Tesseract for on-prem

Schema Extractor

Vision-capable LLM receives the page image, OCR text, and a strict JSON schema. Emits header fields, line items as a list, tax lines, and totals.

Alternatives: Gemini 2.5 Pro (strongest on long multi-page tables), GPT-4o, Mistral Pixtral Large for EU data residency

Schema Validator

Pydantic or Zod schema validates types, required fields, enum values (currency codes, tax categories). Rejects malformed output before it reaches business logic.

Alternatives: Instructor, Outlines (constrained decoding), BAML, TypeChat

Deterministic Math Validator

Checks line-item subtotal equals sum of line items, tax computes against declared rates, grand total reconciles to the penny. Flags discrepancies above 1 cent for review.

Alternatives: Python Decimal arithmetic, Pure rule-based validator, LLM self-critique (slower, less reliable)

Vendor Resolver

Normalizes the extracted vendor name against your master vendor table via fuzzy match plus embedding similarity. Flags new vendors for AP setup.

Alternatives: Exact match only (brittle), Pure embedding match (over-matches), Deterministic fuzzy + embedding rerank (recommended)

Human Review Queue

Any invoice with math discrepancies, low-confidence extractions, or new vendors is surfaced to an AP clerk with the original PDF and the extracted JSON side-by-side.

Alternatives: Slack approval flow, Internal admin UI, Airtable or Retool front-end

ERP / AP System Push

Writes validated invoices to NetSuite, QuickBooks, SAP, or a custom AP ledger via API. Stores the extraction JSON and the source PDF alongside.

Alternatives: Direct DB insert, CSV batch export, Webhook to downstream system

The stack

Extraction modelClaude Sonnet 4 (vision)

Sonnet 4 is the most reliable at following a strict JSON schema on messy multi-column invoices in 2026. Gemini 2.5 Pro wins on very long multi-page invoices because of its larger context. GPT-4o trails on schema adherence when line items are in a wrapped table.

Alternatives: Gemini 2.5 Pro, GPT-4o

Structured output enforcementPydantic schema + Anthropic tool-use with forced JSON

Forcing the model to emit a tool call against a defined schema is more reliable than free-form JSON parsing, even with JSON mode. Pydantic gives runtime validation and auto-retry with errors as input — the pattern that works in production.

Alternatives: Instructor (Python), Zod + OpenAI structured outputs, BAML, Outlines for constrained decoding

OCRAWS Textract for US invoices, Google Document AI elsewhere

Vision LLMs can read scans directly, but Textract returns bounding boxes which help when you need to highlight the source region during human review. For EU data residency, Google Document AI in europe-west regions is the common pick.

Alternatives: Azure Form Recognizer, Tesseract (on-prem only), Skip OCR and rely entirely on vision LLM

Math validationPure Python Decimal with 1-cent tolerance

Money math must be deterministic. Decimal with ROUND_HALF_UP and a 1-cent tolerance catches all real invoices including rounding quirks. Never use floats. Never trust the LLM to do arithmetic — have it emit numbers and verify them externally.

Alternatives: Integer cents (safest), LLM self-check

Vendor resolutionFuzzy match (rapidfuzz) + embedding rerank (Voyage-3)

Invoice vendor names vary (ACME CORP vs Acme Corporation LLC). Fuzzy match handles typographic variance; embedding rerank handles semantic variance (dba names, subsidiaries). Both steps together get you to >99 percent match accuracy on a clean vendor master.

Alternatives: Trigram match in Postgres, Exact match with aliases table

EvaluationHeld-out labeled set of 500+ invoices + field-level accuracy tracking

The relevant metric is per-field accuracy, not per-document accuracy. Track vendor name, invoice number, invoice date, subtotal, tax, total, and line-item count separately. A 95 percent document accuracy can hide a 60 percent line-item accuracy.

Alternatives: Braintrust, Manual spot check

Cost at each scale

Prototype

500 invoices/mo

$35/mo

Extraction (Sonnet 4 vision)$18

OCR (Textract on scanned share)$5

Classifier (Gemini 2.0 Flash)$1

Hosting (Vercel Hobby)$0

Observability + error logging$11

Startup

25,000 invoices/mo

$980/mo

Extraction (Sonnet 4 vision, cached schemas)$520

OCR (Textract ~40 percent of docs)$200

Classifier + vendor resolver$80

Storage (S3 for PDFs + extraction JSON)$25

Infra (Vercel Pro + queue)$50

Observability (Braintrust)$105

Scale

500,000 invoices/mo

$14,500/mo

Extraction (mixed Sonnet 4 / Haiku 4 by complexity)$8,500

OCR (Textract at enterprise rates)$2,800

Classifier + vendor resolver (embeddings)$900

Storage + retention$600

Infra + queue (Inngest + Vercel Enterprise)$800

Observability + evals + human review tool$900

Latency budget

Total P50: 4,650ms

Total P95: 10,280ms

Document classification

350ms · 700ms p95

OCR (when required)

1800ms · 4500ms p95

LLM extraction (single page)

2400ms · 4800ms p95

Schema + math validation

20ms · 60ms p95

Vendor resolution

80ms · 220ms p95

Median

P95

Tradeoffs

Vision LLM alone vs OCR plus LLM

Modern vision LLMs can read scans directly with 90-95 percent accuracy. Adding a dedicated OCR pass (Textract) costs an extra $0.015 per page but boosts accuracy to 98 percent on poor-quality scans and gives you bounding boxes for audit trails. For any invoice workflow that must survive a finance audit, use both.

Zero-shot extraction vs per-vendor templates

Zero-shot extraction with a strong schema handles the long tail of one-off vendors and is what makes the pipeline scalable. Templates are faster and cheaper for the top 20 vendors (who are often 80 percent of volume). A hybrid where the top-N vendors get a template-first extraction with LLM fallback wins on both cost and accuracy, but adds maintenance burden.

Reject on math mismatch vs post-correct

Rejecting every invoice with a 1-cent discrepancy to human review is conservative and correct for finance workloads — you never want the AI silently fixing numbers on a supplier bill. Auto-correcting rounding errors saves human time but erodes trust the first time it corrects something it should not have. Reject and escalate.

Failure modes & guardrails

Line items wrap across pages, extractor emits duplicates or truncates

Mitigation: Pass all pages of a multi-page invoice in a single LLM call rather than per-page. Include an explicit instruction in the schema that line items must be listed once with the total quantity even if the description wraps. Validate line-item count against the line-item count declared in the invoice footer when present.

Totals do not reconcile due to rounding or fee lines

Mitigation: Use Decimal arithmetic with 1-cent tolerance. If mismatch exceeds tolerance, do not silently accept — route to human review with both the extracted fields and a diff showing the reconciliation gap. Track mismatch rate per vendor to spot systematic extraction bugs.

Vendor name on invoice does not match master vendor table

Mitigation: Layered resolution: exact match on tax ID if present, then fuzzy match on name with rapidfuzz, then embedding rerank. If top match scores below a confidence threshold, create a 'suggested match' in the review queue rather than auto-binding. Track vendor auto-bind rate as a quality metric.

Schema violation on edge-case invoices (missing currency, foreign characters, negative line items)

Mitigation: Implement a retry loop: on Pydantic validation failure, send the original prompt plus the schema error back to the model up to 2 times. If still failing, route to human review with the raw model output. Log every schema-error-retry pair and periodically use them as fine-tuning or evaluation examples.

PII and tax ID exposure in observability tools

Mitigation: Redact tax IDs, bank account numbers, and anything in an 'account number' field before sending to your LLM observability vendor. Never log raw extraction JSON for invoices with PII flags. Keep raw invoice storage in your own encrypted S3 bucket with access logging.

Frequently asked questions

Which LLM is best for invoice extraction in 2026?

Claude Sonnet 4 leads on schema adherence for messy invoices and is the default choice. Gemini 2.5 Pro wins on very long multi-page invoices thanks to its context window. GPT-4o is close but slightly worse when line items are in a wrapped table. Always benchmark against a held-out set of 200-500 of your own invoices before choosing.

Do I still need OCR if vision LLMs can read images?

For clean text-native PDFs, no. For scanned images, noisy photos, or invoices you will ever audit, yes — OCR gives bounding boxes which let you highlight the source region in human review. A combined pipeline costs about $0.015 more per page and buys you audit defensibility plus a measurable accuracy lift on poor scans.

How accurate can invoice extraction realistically be?

With a production pipeline (vision LLM + schema validation + math reconciliation + vendor resolution) you can hit 97-99 percent on header fields and 94-98 percent on line items. Getting above 99 percent requires per-vendor templates for your top 10-20 vendors plus human review on everything else.

What is the right way to handle invoice totals that do not reconcile?

Never let the LLM silently fix totals. Use Decimal arithmetic with a 1-cent tolerance. If subtotal plus tax does not equal total within tolerance, route to human review with both the extraction and a reconciliation diff. Silent auto-correction destroys trust the first time it corrects something it should not have.

How do I enforce structured output?

Define a Pydantic (Python) or Zod (TypeScript) schema and use the provider's tool-use or structured-output API. Anthropic's tool-use with a forced tool choice is the most reliable. On validation failure, retry up to twice with the error as input — this catches 90+ percent of edge cases without code changes.

Should I fine-tune for invoice extraction?

Usually no. Strong vision LLMs with a good schema get you far enough that the maintenance cost of a fine-tune pipeline is not worth it until volumes exceed ~100k invoices/month. At that volume, fine-tuning a smaller model (Haiku 4 or Mistral) on 1,000+ labeled examples cuts cost ~40 percent with equivalent accuracy.

How do I evaluate invoice extraction quality?

Field-level accuracy on a labeled held-out set of 500+ invoices, tracked per field (vendor, date, invoice number, subtotal, tax, total, line-item count, per-line-item). A 95 percent document-level accuracy can hide a 60 percent line-item accuracy — always decompose.

Can I run this pipeline on-prem for data residency?

Yes. Use Tesseract or Azure Form Recognizer on-prem for OCR, and Mistral Pixtral Large or a self-hosted Llama 3.3 vision model for extraction. Accuracy is 5-10 points lower than Claude Sonnet 4 in 2026 benchmarks but acceptable for many workflows. EU-region Claude via Bedrock or Vertex is a simpler path for most teams.

Architectures

Text-to-SQL Agent

Reference architecture for translating natural-language questions into safe, correct SQL. Schema-aware prompti...

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Models mentioned

claude-sonnet-4 gemini-2-5-pro gpt-4o

Tools mentioned

instructor pydantic