Reference Architecture · classification

Contract Clause Extraction Pipeline

Last updated: April 16, 2026

Quick answer

The production stack uses Gemini 2.5 Pro for layout-aware PDF parsing, Claude Sonnet 4 for clause classification against a 100-200 item taxonomy (CUAD, LEDGAR, or custom), exact-span citation with page and paragraph references, and human review for low-confidence extractions. Expect $0.50-$2.50 per contract at scale, with 92-96% clause-level F1 after fine-tuning prompts on 200-500 example contracts.

The problem

Legal, finance, and procurement teams sit on thousands of contracts in PDF, DOCX, and scanned formats. They need to answer questions like 'which contracts auto-renew in the next 90 days?' and 'which agreements contain an unlimited indemnity?' without reading every contract. The system must extract clauses with exact citations, preserve legal precision (no paraphrasing), handle redlines and amendments, and provide a complete audit trail for every extraction.

Architecture

input

llm

data

infra

output

Contract Intake

Receives contracts from DocuSign, CLM (Ironclad, Icertis), shared drives. Normalizes to PDF, captures metadata (counterparty, effective date, CLM ID).

Alternatives: DocuSign API, Ironclad, Icertis, SharePoint, S3 bucket

Layout-Aware Parser

Parses PDF/DOCX preserving page numbers, section hierarchy, tables, and redline marks. Extracts clean text plus layout coordinates for citation.

Alternatives: Claude Sonnet 4 vision, GPT-4o vision, Textract + LLM cleanup, Unstructured.io

Section + Clause Splitter

Splits the contract into numbered sections and individual clauses. Handles non-standard numbering (1.1.a, Schedule A, Exhibit 3).

Alternatives: Custom parser, Spacy + rules, LLM-assisted splitter

Clause Classifier

Classifies each clause against a taxonomy (indemnity, limitation of liability, termination, auto-renewal, assignment, IP ownership, etc.). Multi-label with confidence.

Alternatives: GPT-4o, Gemini 2.5 Pro, Fine-tuned Legal-BERT

Clause Attribute Extractor

For each classified clause, extracts typed attributes: indemnity cap amount, termination notice days, auto-renewal period, governing law.

Alternatives: GPT-4o, Gemini 2.5 Pro

Citation Validator

Verifies every extracted attribute cites an exact span in the source document. Rejects or flags extractions that cannot be grounded.

Alternatives: Custom Python validator, Regex span check

Legal Review Queue

Queue for paralegals/lawyers to review low-confidence extractions, high-value clauses (indemnity, LoL), and new contract types.

Alternatives: Custom React UI, Airtable, Ironclad review

Structured Clause Database

Queryable store of clauses with attributes, citations, and links to source contracts. Powers CLM dashboards and risk reports.

Alternatives: Postgres + pgvector, Snowflake, BigQuery

Extraction Audit Log

Append-only log of every extraction: contract hash, model versions, prompt version, output, citation, reviewer action. Legal teams need full traceability.

Alternatives: Postgres, Snowflake, S3 immutable bucket

The stack

Contract parsingGemini 2.5 Pro vision

Gemini 2.5 Pro handles multi-column, table-heavy, and scanned contracts with page-coordinate grounding. Unstructured.io is cheaper for native PDFs but misses scanned contracts. Textract is solid for tables but needs LLM cleanup for legal formatting.

Alternatives: Claude Sonnet 4 vision, GPT-4o vision, Unstructured.io, Textract + cleanup

Clause classifierClaude Sonnet 4 + CUAD taxonomy

Sonnet 4 follows the 41-category CUAD taxonomy with citations better than GPT-4o in 2026 benchmarks. Legal-BERT fine-tuned is faster at scale but requires 1000+ labeled contracts per category. Start with Sonnet 4 few-shot; graduate to Legal-BERT at 50k+ contract volume.

Alternatives: GPT-4o, Gemini 2.5 Pro, Legal-BERT fine-tuned

Attribute extractorClaude Sonnet 4 with JSON schema

Sonnet 4 with structured output mode reliably produces typed JSON (amounts, dates, durations). GPT-4o is nearly as good and has native Structured Outputs. Test both on your specific clause types - the difference is often within eval noise.

Alternatives: GPT-4o, Gemini 2.5 Pro

StoragePostgres + pgvector

Relational queries dominate contract Q&A (show me all contracts where indemnity cap > $10M). Postgres handles this plus vector search for similarity. Snowflake if you already have a data warehouse and want legal data there.

Alternatives: Snowflake, BigQuery, Pinecone + Postgres

TaxonomyCUAD + company-specific extensions

CUAD (41 clause types across 510 contracts) is the standard benchmark. LEDGAR has more categories but less community traction. Start with CUAD, add 20-30 company-specific clauses (e.g. 'MFN pricing', 'data residency') over time.

Alternatives: LEDGAR, Custom, Atticus

Human review UICustom React + tldraw for PDF overlays

Paralegals need to see the PDF with highlighted extraction spans side-by-side with structured output. Generic tools (Airtable, Notion) lose the PDF context and slow review down 3-5x.

Alternatives: Airtable, Ironclad review, Notion

Cost at each scale

Prototype

500 contracts/mo

$220/mo

Gemini 2.5 Pro parsing$45

Claude Sonnet 4 classification$85

Claude Sonnet 4 attribute extraction$60

Hosting (Vercel + Supabase)$25

Observability + evals$5

Startup

20,000 contracts/mo

$8,500/mo

Gemini 2.5 Pro parsing$1,650

Claude Sonnet 4 classification$3,200

Claude Sonnet 4 attribute extraction$2,200

Postgres + pgvector (managed)$400

Review UI + paralegal time$800

Observability (Braintrust)$250

Scale

500,000 contracts/mo

$165,000/mo

Gemini 2.5 Pro parsing (cached)$32,000

Claude Sonnet 4 classification$58,000

Claude Sonnet 4 extraction$42,000

Legal-BERT fine-tuned fallback$4,500

Postgres Enterprise + pgvector$6,500

Review UI + outsourced paralegals$18,000

Audit + compliance infra$4,000

Latency budget

Total P50: 28,200ms

Total P95: 53,300ms

PDF parse (Gemini 2.5 Pro)

4500ms · 9000ms p95

Section + clause split

180ms · 420ms p95

Clause classification (batched, 50 clauses)

6000ms · 11000ms p95

Attribute extraction (per-clause, concurrent)

2400ms · 4600ms p95

Citation validation

120ms · 280ms p95

End-to-end per contract (mostly async)

15000ms · 28000ms p95

Median

P95

Tradeoffs

Whole-contract LLM vs clause-by-clause

Feeding the entire contract to the LLM for extraction is simpler but loses accuracy on long MSAs (30+ pages). Splitting into clauses first and extracting per-clause gives 8-15% higher F1 at the cost of 3-5x more LLM calls. For high-value contract types (MSA, data processing agreement), always split. For short NDAs, whole-contract is fine.

Structured output vs freeform with post-parsing

Anthropic Tool Use and OpenAI Structured Outputs guarantee JSON schema compliance but slightly degrade clause-attribute recall on edge cases. Freeform output + regex post-parse catches more edge cases but fails 2-5% of calls on invalid JSON. Use structured mode for amounts, dates, durations; freeform for narrative fields.

Citation at span-level vs clause-level

Span-level citations (exact characters) are what lawyers want but add 20-30% parsing cost and sometimes break on tables. Clause-level citations (clause ID + page number) are easier to implement and sufficient for most workflows. Upgrade to span-level for high-stakes reviews (M&A due diligence, regulatory filings).

Failure modes & guardrails

Extraction says 'unlimited indemnity' but the clause actually has a carve-out capping it

Mitigation: Always extract the full clause text alongside the attribute. Require legal review on any attribute marked 'unlimited', 'perpetual', 'exclusive', or 'irrevocable'. These are the clauses that cost companies money when extraction is wrong.

Model cannot find a citation span, so hallucinates the clause

Mitigation: Require exact-span citations as a contract of the extraction. If the model cannot locate the claimed fact in the source, reject the extraction and route to review. Never store an attribute without a verified citation.

Amendments and redlines change the effective terms but pipeline processes only the base contract

Mitigation: Detect amendments (explicit reference to a parent agreement, 'this amends', redline marks). Process in chronological order and merge. Flag contracts where amendment-chain reconciliation failed for human review.

Non-English contracts processed with English-tuned prompts yield garbage

Mitigation: Detect contract language first. Route Spanish, French, German, Mandarin, and Japanese contracts through language-specific prompts. Use Gemini 2.5 Pro (strongest multilingual) or fine-tuned local models. Do not silently run English prompts on Japanese contracts.

Attorney-client privilege leaked to third-party model provider

Mitigation: Use zero-data-retention endpoints (Anthropic ZDR, OpenAI ZDR, Vertex AI). Sign BAA or DPA. Redact client names, matter numbers, and privileged communications before sending. For highly sensitive contracts (M&A, litigation), use self-hosted models (Llama 3, Mistral Large).

Frequently asked questions

Which LLM is best for contract extraction?

Claude Sonnet 4 leads on clause classification and structured extraction in 2026 benchmarks on CUAD. Gemini 2.5 Pro is strongest for parsing complex layouts and multilingual contracts. GPT-4o is a solid generalist. For high-volume, high-accuracy production work, many teams run Sonnet 4 as primary with GPT-4o as fallback for consistency checks.

How do I handle scanned PDFs?

Gemini 2.5 Pro handles scans directly with reasonable accuracy. For cleaner output, pre-process with Azure Form Recognizer or AWS Textract to extract text + layout coordinates, then feed to Sonnet 4 for classification. Do not rely on Tesseract - it misses tables and complex layouts that are ubiquitous in contracts.

What clause taxonomy should I use?

Start with CUAD (41 categories, widely benchmarked, reasonable coverage for commercial contracts). Add company-specific clauses: MFN pricing, data residency, SOC 2 commitments, AI use restrictions. Aim for 60-100 categories total - more than that dilutes classifier signal.

How accurate is contract extraction in 2026?

For top clauses (parties, governing law, effective date, term): 95-99% F1. For medium clauses (indemnity, LoL, termination): 88-94%. For rare clauses (change of control, MFN, non-compete): 75-85%. Always human-review high-value extractions regardless of confidence - the cost of missing an indemnity cap is far higher than the cost of review.

Should I fine-tune a model on legal text?

Probably not for the primary model. Sonnet 4 and GPT-4o few-shot out-perform most fine-tuned Legal-BERT setups. Fine-tune when: (1) you have 5k+ labeled contracts, (2) latency matters more than quality, (3) you need to run on-prem. Otherwise prompt engineering + CUAD few-shots is more cost-effective.

How do I prove my extraction is accurate for audit?

Three things: (1) maintain a golden test set of 200-500 contracts with expert-labeled clauses, (2) re-run the eval on every prompt or model change and store the report, (3) log every extraction with model version, prompt version, input hash, and citation. Lawyers and regulators want a versioned trail, not just an accuracy number.

Can I replace paralegals with this pipeline?

No. Expect to replace 40-70% of paralegal review time on routine contracts (NDAs, standard SOWs, vendor agreements) and free them to work on MSAs, M&A due diligence, and exception handling. Full replacement fails because 5-10% of contracts require judgment that no 2026 model has (ambiguous drafting, industry-specific conventions, precedent cases).

Architectures

Resume Screening Pipeline

Reference architecture for LLM-assisted resume screening. Parses PDFs, matches against a job description, extr...

OCR + Document Understanding Pipeline

Reference architecture for turning scanned documents, invoices, receipts, forms, and handwritten notes into st...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 gemini-2-5-pro gpt-4o