task · Use Case

AI for OCR and Structured Data Extraction

Use LLMs to extract structured data from scanned documents, handwritten forms, and invoices with confidence scoring and human review thresholds — going far beyond what traditional OCR engines can achieve.

Updated Apr 16, 20265 workflows~$1–$15 per 1,000 requests

Quick answer

The best stack combines a vision model (GPT-4o or Claude Sonnet 4 with vision) for layout understanding, a structured output layer (JSON schema enforcement), and a confidence-scored human review queue. Typical cost runs $1-8 per 1,000 pages depending on document complexity and model choice, with field-level accuracy reaching 96-99% on typed documents.

The problem

Traditional OCR engines like Tesseract top out at 85-92% field-level accuracy on real-world documents, meaning 1 in 10 fields requires manual correction. Finance and operations teams processing 5,000+ invoices per month spend 200+ hours on manual data entry corrections alone. Handwritten forms, mixed layouts, and low-resolution scans reduce accuracy further to 60-70%, making traditional pipelines economically unviable at scale.

Core workflows

Invoice Data Extraction

Extract vendor, line items, totals, tax, and payment terms from PDF/image invoices into structured JSON. Cuts AP processing time from 8 minutes to under 30 seconds per invoice.

gpt-4oreductoArchitecture →

Handwritten Form Processing

Convert handwritten intake forms, surveys, and insurance claims into structured records. Vision LLMs handle cursive and mixed print/handwrite layouts that rule-based OCR cannot parse.

claude-sonnet-4aws-textractArchitecture →

Confidence-Scored Review Queue

Flag fields below a confidence threshold (e.g., <0.85) for human review. Reduces manual review volume by 70-80% while maintaining SLA accuracy guarantees.

claude-haiku-3-5humanloopArchitecture →

Multi-Document Classification + Extraction

Classify incoming document type (invoice, PO, receipt, contract) then route to the appropriate extraction schema. Handles mixed document batches without manual sorting.

claude-sonnet-4unstructuredArchitecture →

Contract Clause Extraction

Pull key dates, parties, payment terms, and SLA clauses from signed contracts. Reduces legal review time per contract from 4 hours to under 20 minutes.

claude-sonnet-4llamaparseArchitecture →

Top tools

  • reducto
  • aws-textract
  • llamaparse
  • unstructured
  • azure-document-intelligence
  • google-document-ai

Top models

  • gpt-4o
  • claude-sonnet-4
  • claude-haiku-3-5
  • gemini-2-0-flash

FAQs

How accurate is LLM-based OCR compared to Tesseract or AWS Textract?

Traditional OCR engines like Tesseract achieve 85-92% field-level accuracy on clean typed documents but degrade significantly on handwritten text, low-resolution scans, and non-standard layouts. AWS Textract and Google Document AI improve this to 93-96% by combining OCR with layout analysis. Vision LLMs (GPT-4o, Claude Sonnet 4 with vision) typically reach 96-99% on typed documents and 88-94% on handwritten forms by understanding context — they can infer that '1,O00' is '$1,000' even if the OCR layer misread the zero. The tradeoff is cost: Textract runs ~$1.50/1,000 pages vs $4-8/1,000 for a vision LLM pipeline.

What confidence scoring approach should I use?

The most practical approach is to ask the model to return a per-field confidence score (0-1) alongside each extracted value, then route documents below a threshold (commonly 0.80-0.90 depending on risk tolerance) to a human review queue. For structured output, use JSON schema enforcement (OpenAI's structured outputs or Anthropic's tool_use) to guarantee field presence. Combine model self-reported confidence with a secondary validation layer: regex checks on dates, checksum validation on amounts, and cross-field consistency checks (e.g., line item totals must sum to invoice total).

Which documents are hardest for AI OCR to handle?

The most challenging categories are: (1) carbon copy forms with faded ink and misaligned layers, (2) handwritten tables where column alignment is inconsistent, (3) multi-column layouts with footnotes that visually interrupt the reading order, (4) documents with stamps, redactions, or watermarks obscuring key fields, and (5) mixed-language documents. For these cases, a preprocessing step (deskew, denoise, upscale with an SR model) before sending to the vision LLM improves accuracy by 8-15 percentage points.

How do I handle multi-page documents like long contracts or medical records?

Split the document into logical chunks before sending to the LLM — don't dump 50 pages into a single context window. Tools like LlamaParse, Reducto, and Unstructured handle PDF segmentation automatically. For contracts, extract a table of contents first, then process each section independently with a targeted schema. For medical records, process by document section (admission notes, labs, discharge summary). Maintain a document-level metadata object that aggregates across page-level extractions and flags cross-page consistency issues.

What is the right threshold for sending documents to human review?

There is no universal answer — it depends on downstream consequences. For payment processing (high financial risk), route anything below 0.92 confidence. For internal data warehousing (lower risk), 0.75-0.80 is often sufficient. A practical approach: start with 0.85 across all fields, measure your human reviewer's correction rate over 1,000 documents, then adjust per field type. Amount fields typically need a higher threshold than address fields. Track false negatives (confident but wrong) separately from false positives (uncertain but correct) to calibrate over time.

Can I fine-tune a model specifically for my document types?

Yes, and it is worth doing if you process more than ~50,000 documents per month of the same type. Fine-tuned models on domain-specific documents (e.g., utility bills, insurance EOBs, customs declarations) typically outperform prompted frontier models by 3-7 percentage points at 60-80% lower inference cost. Use GPT-4o or Claude to generate a labeled training set from your first 500-1,000 documents with human-verified ground truth, then fine-tune a smaller model (GPT-4o-mini, Haiku) on that dataset. Evaluate on a held-out set monthly and retrain when document layouts change.

Related architectures