Reference Architecture · multimodal

OCR + Document Understanding Pipeline

Last updated: April 16, 2026

Quick answer

The production stack pairs a traditional OCR engine (Azure Form Recognizer, AWS Textract, or Google Document AI) for layout and raw text with a vision LLM (Gemini 2.5 Pro or Claude Sonnet 4 vision) for understanding, validation, and handwriting. Route 80-90% of volume through the cheaper OCR engine, use the vision LLM on edge cases and for cross-validation. Expect $0.015-$0.08 per document at scale with 93-97% field-level accuracy after fine-tuning prompts on your specific document types.

The problem

You receive thousands of scanned documents per day - invoices, receipts, tax forms, IDs, shipping manifests, patient intake forms - in PDFs, phone photos, and faxes. You need to extract structured fields (invoice number, line items, totals, dates) with 95%+ accuracy across 20+ languages, handle handwriting and poor scans, preserve the layout for audit, and feed results into ERP/AP/EHR systems. Off-the-shelf OCR misses too much; LLM-only is too slow and expensive at scale.

Architecture

low OCR confidenceinvalid or low confidencevalidatedDocument IntakeINPUTDocument Type ClassifierLLMOCR Engine (Layout + Text)LLMHandwriting + Low-Quality HandlerLLMStructured Field ExtractorLLMField ValidatorINFRAHuman Review QueueOUTPUTDownstream SystemOUTPUTAudit Log + Original StorageDATA
input
llm
data
infra
output

Document Intake

Receives PDFs, images, fax, email attachments. Normalizes to PDF+image format, deskews, denoises.

Alternatives: Email parser, SFTP ingest, Web upload, RPA bots

Document Type Classifier

Classifies each document (invoice vs receipt vs W-2 vs ID vs freeform). Routes to the right extractor.

Alternatives: Claude Haiku 4, GPT-4o-mini, Fine-tuned DistilBERT

OCR Engine (Layout + Text)

Extracts text with bounding boxes, tables, and page structure. Handles printed multi-language text and common handwriting.

Alternatives: AWS Textract, Google Document AI, Tesseract 5, PaddleOCR, Mistral OCR

Handwriting + Low-Quality Handler

Vision LLM fallback for handwritten fields, faded scans, and poor-quality photos. Runs when OCR confidence is low.

Alternatives: Claude Sonnet 4 vision, GPT-4o vision

Structured Field Extractor

Maps OCR output to the document-type schema (invoice: vendor, number, line items, total; receipt: merchant, date, items, tax).

Alternatives: GPT-4o, Gemini 2.5 Pro, Azure prebuilt invoice model

Field Validator

Validates extracted fields: total equals sum of line items, dates are plausible, currencies match. Rejects or flags inconsistencies.

Alternatives: Custom Python validator, Great Expectations, Pydantic schemas

Human Review Queue

Queue for documents below confidence threshold, validation failures, or high-value transactions (>$10k). Reviewer sees side-by-side OCR and structured output.

Alternatives: Rossum, HyperScience, Custom React + tldraw

Downstream System

Posts validated structured data to ERP (NetSuite, SAP), AP system (Coupa, Tipalti), or EHR (Epic, Cerner).

Alternatives: NetSuite API, SAP S/4, Coupa, Workday, Epic FHIR

Audit Log + Original Storage

Stores original document, OCR output, extracted fields, model versions, reviewer action. Required for SOX, HIPAA, tax compliance.

Alternatives: S3 immutable bucket, Azure Blob WORM, Snowflake

The stack

Document classifierGemini 2.0 Flash vision

Gemini 2.0 Flash is the cheapest vision classifier at $0.075/$0.30 per MTok. Good enough to distinguish invoice/receipt/W-2/ID with 95%+ accuracy. Fine-tune a DistilBERT only if you have 10k+ classifier examples per type.

Alternatives: Claude Haiku 4 vision, GPT-4o-mini vision, Fine-tuned DistilBERT on OCR output

OCR engineAzure Form Recognizer

Azure Form Recognizer has the best prebuilt invoice/receipt/ID models and strong multilingual support. Textract is better for pure table extraction. Document AI wins on non-Latin scripts. Tesseract is free but misses tables and struggles with 20%+ of real-world documents. Mistral OCR is promising and cheap.

Alternatives: AWS Textract, Google Document AI, Mistral OCR, Tesseract 5, PaddleOCR

Handwriting + low-quality vision LLMGemini 2.5 Pro vision

Gemini 2.5 Pro has the best handwriting recognition in 2026 across English and non-Latin scripts. Claude Sonnet 4 is close and better at structured JSON output. GPT-4o is competitive but weaker on non-Latin handwriting.

Alternatives: Claude Sonnet 4 vision, GPT-4o vision

Field extractorClaude Sonnet 4 + JSON schema

Sonnet 4 produces reliably typed JSON against your schema. Azure prebuilt models (invoice, receipt, W-2) are faster and cheaper when they cover your document type out-of-the-box. Use Sonnet 4 as the fallback and for custom document types.

Alternatives: GPT-4o with Structured Outputs, Gemini 2.5 Pro, Azure prebuilt models

ValidatorPydantic + custom rules

Pydantic catches type errors (dates, currencies, amounts) out of the box. Add custom rules for cross-field checks (sum of line items equals subtotal, tax rate is reasonable for the country, invoice date is not in the future). Validation catches 30-50% of bad extractions before they hit downstream systems.

Alternatives: Great Expectations, OPA rules, JSON Schema

Human review UICustom React + tldraw overlay

Reviewers need the image side-by-side with structured fields, with click-to-highlight on the source bounding box. Rossum and HyperScience are excellent turnkey solutions for invoices but lock-in and expensive. Build custom if you have 5+ document types or non-standard workflows.

Alternatives: Rossum, HyperScience, Unstructured.io UI

Cost at each scale

Prototype

10,000 documents/mo

$280/mo

Gemini 2.0 Flash classifier$15
Azure Form Recognizer (prebuilt)$100
Gemini 2.5 Pro handwriting (10%)$65
Claude Sonnet 4 extraction$85
Storage + hosting$15

Startup

500,000 documents/mo

$12,500/mo

Gemini 2.0 Flash classifier$650
Azure Form Recognizer (bulk)$4,500
Gemini 2.5 Pro handwriting (8%)$2,400
Claude Sonnet 4 extraction + validation$3,200
S3 WORM storage$550
Human review + tooling$900
Observability$300

Scale

20,000,000 documents/mo

$380,000/mo

Self-hosted classifier cluster$16,000
Azure Form Recognizer enterprise$120,000
Gemini 2.5 Pro handwriting tail (5%)$72,000
Claude Sonnet 4 extraction$95,000
Validation + orchestration$12,000
Human review BPO$38,000
Storage + audit retention$16,000
SRE + compliance$11,000

Latency budget

Total P50: 12,510ms
Total P95: 25,880ms
Document classification (Gemini Flash)
450ms · 900ms p95
OCR engine (Azure Form Recognizer)
2200ms · 4500ms p95
Handwriting vision LLM (tail)
2800ms · 5500ms p95
Structured extraction (Sonnet 4)
1800ms · 3800ms p95
Validation
60ms · 180ms p95
End-to-end per document
5200ms · 11000ms p95
Median
P95

Tradeoffs

Traditional OCR vs vision LLM end-to-end

Running every document through Gemini 2.5 Pro vision simplifies the pipeline and handles handwriting natively, but costs $0.10-$0.30 per document vs $0.02-$0.05 for Azure Form Recognizer. At low volume (<100k docs/month) or edge cases, vision LLM is fine. At scale, a hybrid (OCR engine for 80%, vision LLM for 20%) is 3-5x cheaper with equal accuracy.

Prebuilt model vs custom extractor

Azure/AWS/Google all ship prebuilt invoice, receipt, and ID models that are 90-95% accurate out of the box. They are cheap and fast but cannot be customized. For custom document types (industry-specific forms, proprietary workflows) or last-mile accuracy gains, a vision LLM with a schema prompt is the right choice.

Confidence threshold - auto-accept vs always-review

Auto-accepting extractions above 95% confidence cuts human review costs 3-5x, but 1 in 300 errors slip through. For AP (accounts payable) that is a $5k wire transfer to the wrong vendor. For medical records, it is a compliance incident. Calibrate the auto-accept threshold per document type based on the downstream cost of an error.

Failure modes & guardrails

Handwriting misread as similar-looking characters (0 vs O, 1 vs l vs I)

Mitigation: Use character-level confidence scores from the OCR engine. For critical fields (account numbers, amounts, dates), require confidence above 0.9 or route to human review. Run a second pass with Gemini 2.5 Pro vision and compare; disagreements trigger review.

Rotated, skewed, or upside-down photos fail OCR completely

Mitigation: Run a deskew and rotation-detection pass (OpenCV or a small ML model) before OCR. Reject documents below a minimum DPI (150+) with a user-facing error. Mobile uploads should preview the cropped/deskewed version so users can re-scan before submitting.

Amounts extracted with wrong thousand/decimal separator (European 1.500,00 vs US 1,500.00)

Mitigation: Detect the document's locale first (country from address, currency symbol, tax ID format). Apply locale-appropriate number parsing. Validate: line items sum to the stated subtotal - if they don't, the separator was probably misparsed.

PII/PHI leaked into LLM logs and third-party provider

Mitigation: For HIPAA/PII-sensitive documents: use zero-data-retention endpoints (Anthropic ZDR, Vertex AI), sign BAAs, and redact obvious PII (SSNs, DOBs, account numbers) from prompt context when possible. For highest sensitivity (health records, legal docs), self-host Llama 3 or Mistral Large.

Compliance failure - regulator asks for original and extracted side-by-side

Mitigation: Store the original image/PDF, the OCR output with bounding boxes, the extracted fields, model versions, prompt versions, validator results, and any human edits. Keep for the longer of 7 years (SOX), 10 years (HIPAA), or the jurisdiction's tax retention rule. S3 with object lock or Azure Blob WORM is the default.

Frequently asked questions

Can I just use Gemini 2.5 Pro for everything?

For low volume (<10k docs/month) or complex mixed document types - yes, it's simple and capable. At high volume, it's 5-10x more expensive than Azure Form Recognizer and slower. Most production teams use a hybrid: cheap OCR for the 80% common case, Gemini 2.5 Pro for handwriting, low-quality scans, and custom document types.

Azure vs AWS vs Google for OCR?

Azure Form Recognizer: best invoice/receipt/ID prebuilt models, strongest non-Latin script. AWS Textract: best pure table extraction. Google Document AI: best for specialized forms (W-9, 1099) and the widest language support. Run side-by-side evals on 500 of your actual documents - accuracy varies wildly by document type.

How do I handle handwriting?

Azure Form Recognizer and Textract both handle printed-looking handwriting. For cursive or messy handwriting (doctor's notes, forms filled in pen), route to Gemini 2.5 Pro vision - it's the best in 2026. Expect 85-92% character-level accuracy on handwriting vs 97-99% on printed text.

How much does a document cost to process?

At scale (20M docs/month), budget $0.015-$0.025 per document all-in. Prebuilt OCR is the majority of cost ($0.005-$0.015), vision LLM tail adds $0.003-$0.015, storage and review add $0.003-$0.008. At low volume (under 100k/month), budget $0.03-$0.10 per doc - prebuilt models don't get bulk pricing until 1M+.

What document types are easiest vs hardest?

Easiest: US invoices in English, receipts, W-2s, standard IDs - 96%+ accuracy with prebuilt models. Medium: multilingual invoices, European VAT forms, insurance claims - 90-94%. Hardest: handwritten medical notes, faxed legal documents, multi-page shipping manifests with tables, rotated phone photos - 75-88%. Budget more human review for the hardest types.

How do I eval OCR accuracy?

Maintain a golden set of 500-2000 human-labeled documents across your types. Re-run the pipeline after any model or prompt change. Measure field-level accuracy (what % of extracted fields match ground truth) and document-level accuracy (what % of documents are fully correct). Document-level accuracy is always much lower - use it as the real metric.

How does this work with ERPs like SAP or NetSuite?

Extract fields into an intermediate JSON (your canonical invoice schema), validate, then map to the ERP API. Most ERPs have webhooks or an inbox flow - NetSuite's SuiteTalk API, SAP's OData - that accepts structured invoice JSON. Keep the ERP mapping logic separate from extraction so you can retarget between ERPs.

Related