Reference Architecture · multimodal
OCR + Document Understanding Pipeline
Last updated: April 16, 2026
Quick answer
The production stack pairs a traditional OCR engine (Azure Form Recognizer, AWS Textract, or Google Document AI) for layout and raw text with a vision LLM (Gemini 2.5 Pro or Claude Sonnet 4 vision) for understanding, validation, and handwriting. Route 80-90% of volume through the cheaper OCR engine, use the vision LLM on edge cases and for cross-validation. Expect $0.015-$0.08 per document at scale with 93-97% field-level accuracy after fine-tuning prompts on your specific document types.
The problem
You receive thousands of scanned documents per day - invoices, receipts, tax forms, IDs, shipping manifests, patient intake forms - in PDFs, phone photos, and faxes. You need to extract structured fields (invoice number, line items, totals, dates) with 95%+ accuracy across 20+ languages, handle handwriting and poor scans, preserve the layout for audit, and feed results into ERP/AP/EHR systems. Off-the-shelf OCR misses too much; LLM-only is too slow and expensive at scale.
Architecture
Document Intake
Receives PDFs, images, fax, email attachments. Normalizes to PDF+image format, deskews, denoises.
Alternatives: Email parser, SFTP ingest, Web upload, RPA bots
Document Type Classifier
Classifies each document (invoice vs receipt vs W-2 vs ID vs freeform). Routes to the right extractor.
Alternatives: Claude Haiku 4, GPT-4o-mini, Fine-tuned DistilBERT
OCR Engine (Layout + Text)
Extracts text with bounding boxes, tables, and page structure. Handles printed multi-language text and common handwriting.
Alternatives: AWS Textract, Google Document AI, Tesseract 5, PaddleOCR, Mistral OCR
Handwriting + Low-Quality Handler
Vision LLM fallback for handwritten fields, faded scans, and poor-quality photos. Runs when OCR confidence is low.
Alternatives: Claude Sonnet 4 vision, GPT-4o vision
Structured Field Extractor
Maps OCR output to the document-type schema (invoice: vendor, number, line items, total; receipt: merchant, date, items, tax).
Alternatives: GPT-4o, Gemini 2.5 Pro, Azure prebuilt invoice model
Field Validator
Validates extracted fields: total equals sum of line items, dates are plausible, currencies match. Rejects or flags inconsistencies.
Alternatives: Custom Python validator, Great Expectations, Pydantic schemas
Human Review Queue
Queue for documents below confidence threshold, validation failures, or high-value transactions (>$10k). Reviewer sees side-by-side OCR and structured output.
Alternatives: Rossum, HyperScience, Custom React + tldraw
Downstream System
Posts validated structured data to ERP (NetSuite, SAP), AP system (Coupa, Tipalti), or EHR (Epic, Cerner).
Alternatives: NetSuite API, SAP S/4, Coupa, Workday, Epic FHIR
Audit Log + Original Storage
Stores original document, OCR output, extracted fields, model versions, reviewer action. Required for SOX, HIPAA, tax compliance.
Alternatives: S3 immutable bucket, Azure Blob WORM, Snowflake
The stack
Gemini 2.0 Flash is the cheapest vision classifier at $0.075/$0.30 per MTok. Good enough to distinguish invoice/receipt/W-2/ID with 95%+ accuracy. Fine-tune a DistilBERT only if you have 10k+ classifier examples per type.
Alternatives: Claude Haiku 4 vision, GPT-4o-mini vision, Fine-tuned DistilBERT on OCR output
Azure Form Recognizer has the best prebuilt invoice/receipt/ID models and strong multilingual support. Textract is better for pure table extraction. Document AI wins on non-Latin scripts. Tesseract is free but misses tables and struggles with 20%+ of real-world documents. Mistral OCR is promising and cheap.
Alternatives: AWS Textract, Google Document AI, Mistral OCR, Tesseract 5, PaddleOCR
Gemini 2.5 Pro has the best handwriting recognition in 2026 across English and non-Latin scripts. Claude Sonnet 4 is close and better at structured JSON output. GPT-4o is competitive but weaker on non-Latin handwriting.
Alternatives: Claude Sonnet 4 vision, GPT-4o vision
Sonnet 4 produces reliably typed JSON against your schema. Azure prebuilt models (invoice, receipt, W-2) are faster and cheaper when they cover your document type out-of-the-box. Use Sonnet 4 as the fallback and for custom document types.
Alternatives: GPT-4o with Structured Outputs, Gemini 2.5 Pro, Azure prebuilt models
Pydantic catches type errors (dates, currencies, amounts) out of the box. Add custom rules for cross-field checks (sum of line items equals subtotal, tax rate is reasonable for the country, invoice date is not in the future). Validation catches 30-50% of bad extractions before they hit downstream systems.
Alternatives: Great Expectations, OPA rules, JSON Schema
Reviewers need the image side-by-side with structured fields, with click-to-highlight on the source bounding box. Rossum and HyperScience are excellent turnkey solutions for invoices but lock-in and expensive. Build custom if you have 5+ document types or non-standard workflows.
Alternatives: Rossum, HyperScience, Unstructured.io UI
Cost at each scale
Prototype
10,000 documents/mo
$280/mo
Startup
500,000 documents/mo
$12,500/mo
Scale
20,000,000 documents/mo
$380,000/mo
Latency budget
Tradeoffs
Traditional OCR vs vision LLM end-to-end
Running every document through Gemini 2.5 Pro vision simplifies the pipeline and handles handwriting natively, but costs $0.10-$0.30 per document vs $0.02-$0.05 for Azure Form Recognizer. At low volume (<100k docs/month) or edge cases, vision LLM is fine. At scale, a hybrid (OCR engine for 80%, vision LLM for 20%) is 3-5x cheaper with equal accuracy.
Prebuilt model vs custom extractor
Azure/AWS/Google all ship prebuilt invoice, receipt, and ID models that are 90-95% accurate out of the box. They are cheap and fast but cannot be customized. For custom document types (industry-specific forms, proprietary workflows) or last-mile accuracy gains, a vision LLM with a schema prompt is the right choice.
Confidence threshold - auto-accept vs always-review
Auto-accepting extractions above 95% confidence cuts human review costs 3-5x, but 1 in 300 errors slip through. For AP (accounts payable) that is a $5k wire transfer to the wrong vendor. For medical records, it is a compliance incident. Calibrate the auto-accept threshold per document type based on the downstream cost of an error.
Failure modes & guardrails
Handwriting misread as similar-looking characters (0 vs O, 1 vs l vs I)
Mitigation: Use character-level confidence scores from the OCR engine. For critical fields (account numbers, amounts, dates), require confidence above 0.9 or route to human review. Run a second pass with Gemini 2.5 Pro vision and compare; disagreements trigger review.
Rotated, skewed, or upside-down photos fail OCR completely
Mitigation: Run a deskew and rotation-detection pass (OpenCV or a small ML model) before OCR. Reject documents below a minimum DPI (150+) with a user-facing error. Mobile uploads should preview the cropped/deskewed version so users can re-scan before submitting.
Amounts extracted with wrong thousand/decimal separator (European 1.500,00 vs US 1,500.00)
Mitigation: Detect the document's locale first (country from address, currency symbol, tax ID format). Apply locale-appropriate number parsing. Validate: line items sum to the stated subtotal - if they don't, the separator was probably misparsed.
PII/PHI leaked into LLM logs and third-party provider
Mitigation: For HIPAA/PII-sensitive documents: use zero-data-retention endpoints (Anthropic ZDR, Vertex AI), sign BAAs, and redact obvious PII (SSNs, DOBs, account numbers) from prompt context when possible. For highest sensitivity (health records, legal docs), self-host Llama 3 or Mistral Large.
Compliance failure - regulator asks for original and extracted side-by-side
Mitigation: Store the original image/PDF, the OCR output with bounding boxes, the extracted fields, model versions, prompt versions, validator results, and any human edits. Keep for the longer of 7 years (SOX), 10 years (HIPAA), or the jurisdiction's tax retention rule. S3 with object lock or Azure Blob WORM is the default.
Frequently asked questions
Can I just use Gemini 2.5 Pro for everything?
For low volume (<10k docs/month) or complex mixed document types - yes, it's simple and capable. At high volume, it's 5-10x more expensive than Azure Form Recognizer and slower. Most production teams use a hybrid: cheap OCR for the 80% common case, Gemini 2.5 Pro for handwriting, low-quality scans, and custom document types.
Azure vs AWS vs Google for OCR?
Azure Form Recognizer: best invoice/receipt/ID prebuilt models, strongest non-Latin script. AWS Textract: best pure table extraction. Google Document AI: best for specialized forms (W-9, 1099) and the widest language support. Run side-by-side evals on 500 of your actual documents - accuracy varies wildly by document type.
How do I handle handwriting?
Azure Form Recognizer and Textract both handle printed-looking handwriting. For cursive or messy handwriting (doctor's notes, forms filled in pen), route to Gemini 2.5 Pro vision - it's the best in 2026. Expect 85-92% character-level accuracy on handwriting vs 97-99% on printed text.
How much does a document cost to process?
At scale (20M docs/month), budget $0.015-$0.025 per document all-in. Prebuilt OCR is the majority of cost ($0.005-$0.015), vision LLM tail adds $0.003-$0.015, storage and review add $0.003-$0.008. At low volume (under 100k/month), budget $0.03-$0.10 per doc - prebuilt models don't get bulk pricing until 1M+.
What document types are easiest vs hardest?
Easiest: US invoices in English, receipts, W-2s, standard IDs - 96%+ accuracy with prebuilt models. Medium: multilingual invoices, European VAT forms, insurance claims - 90-94%. Hardest: handwritten medical notes, faxed legal documents, multi-page shipping manifests with tables, rotated phone photos - 75-88%. Budget more human review for the hardest types.
How do I eval OCR accuracy?
Maintain a golden set of 500-2000 human-labeled documents across your types. Re-run the pipeline after any model or prompt change. Measure field-level accuracy (what % of extracted fields match ground truth) and document-level accuracy (what % of documents are fully correct). Document-level accuracy is always much lower - use it as the real metric.
How does this work with ERPs like SAP or NetSuite?
Extract fields into an intermediate JSON (your canonical invoice schema), validate, then map to the ERP API. Most ERPs have webhooks or an inbox flow - NetSuite's SuiteTalk API, SAP's OData - that accepts structured invoice JSON. Keep the ERP mapping logic separate from extraction so you can retarget between ERPs.
Related
Architectures
Contract Clause Extraction Pipeline
Reference architecture for turning legal contracts (MSAs, NDAs, SOWs, leases) into a structured clause databas...
Resume Screening Pipeline
Reference architecture for LLM-assisted resume screening. Parses PDFs, matches against a job description, extr...
Image-Based Search (Visual Similarity + Text Query)
Reference architecture for product/media catalog search using image similarity, text-to-image queries, and hyb...