Reference Architecture · rag

Legal Document Search

Last updated: April 16, 2026

Quick answer

Use Voyage-3-large embeddings (top MTEB for legal in 2026), chunk at paragraph boundaries with clause-type metadata, store in Qdrant with hybrid BM25, rerank with Voyage Rerank 2.5, and synthesize with Claude Opus 4 for citation-critical answers and Claude Sonnet 4 for the rest. Enforce a citation validator that rejects any answer whose references don’t match the retrieved chunks verbatim. Expect $0.30 to $1.50 per query — precision is expensive, but cheaper than malpractice.

The problem

Lawyers need to search 1M+ pages of contracts, filings, memos, and case law with a natural-language question — ‘find every MSA with a cap on indirect damages above $2M signed after 2023’ — and get paragraph-level citations, a confidence score, and zero fabricated references. Every answer has to be defensible in a deposition. Precision beats recall: a missed document is recoverable, a fabricated citation is a career event.

Architecture

answer + verified citationsDocument IngesterINPUTClause-aware ChunkerDATALegal-tuned EmbeddingsLLMVector + Metadata StoreDATAHybrid + Filter RetrieverINFRACross-encoder RerankerLLMAnswer SynthesizerLLMCitation ValidatorINFRAReview InterfaceOUTPUT
input
llm
data
infra
output

Document Ingester

Pulls contracts from iManage, NetDocuments, SharePoint, and local PDFs. OCRs scanned documents with Azure Document Intelligence or Unstructured Premium. Extracts clause-level structure and signature metadata.

Alternatives: iManage API, NetDocuments API, Kira Systems, Harvey connector

Clause-aware Chunker

Splits at clause boundaries (numbered sections, indemnification, limitation of liability, etc.). Each chunk carries doc_id, page, paragraph_id, clause_type, signature_date, and counterparty.

Alternatives: Semantic chunking, Tree-sitter for structured contracts, Contextual retrieval (Anthropic)

Legal-tuned Embeddings

Voyage-3-large is the top 2026 embedding model on legal retrieval benchmarks (LegalBench-RAG, ContractNLI). Outperforms general-purpose models by 12-18% on clause retrieval.

Alternatives: Cohere Embed v3, OpenAI text-embedding-3-large, voyage-law-2

Vector + Metadata Store

Qdrant with native hybrid search. Metadata payload includes clause_type, matter_id, client_id, privilege_flag, and signature_date for fast structured filtering before semantic ranking.

Alternatives: Weaviate, pgvector, Pinecone

Hybrid + Filter Retriever

Applies mandatory metadata filters (matter_id, client, privilege) first. Then runs dense + BM25 in parallel and merges with RRF. Returns top 100 for legal queries vs top 20 for general RAG — precision requires a deeper candidate pool.

Alternatives: ColBERT late-interaction, SPLADE sparse, RRF with weighted merging

Cross-encoder Reranker

Voyage Rerank 2.5 ranks top 100 down to top 10. Cross-encoder scoring is ~3x more accurate than cosine-only ranking on long legal passages.

Alternatives: Cohere Rerank v3, Jina Reranker v2

Answer Synthesizer

Claude Opus 4 for high-stakes citation answers (M&A diligence, litigation). Claude Sonnet 4 for routine contract Q&A. Both are instructed to quote verbatim with paragraph-level citations.

Alternatives: claude-sonnet-4, GPT-4o, Gemini 2.5 Pro (2M context for whole-contract questions)

Citation Validator

Post-generation pass that extracts every citation from the answer and verifies it exists in the retrieved chunks. Any citation that does not match verbatim is flagged and the answer is either regenerated or returned with a warning.

Alternatives: Custom regex validator, Guardrails AI, NeMo Guardrails

Review Interface

Answer with side-by-side source viewer. Each citation is clickable and highlights the exact paragraph in the PDF. Confidence score visible. Export to Word for inclusion in a memo or brief.

Alternatives: iManage Insight, Harvey UI, Custom React + PSPDFKit

The stack

OCR + ingestionAzure Document Intelligence (Layout model)

Legal documents are often scanned PDFs with complex layouts — multi-column text, signature blocks, exhibits. Azure’s Layout model outperforms general OCR by reading table structure and reading order correctly, which matters for clause extraction.

Alternatives: Unstructured Premium, AWS Textract, Google Document AI

ChunkingClause-aware + contextual retrieval

Legal text lives at the clause level. Splitting mid-clause destroys meaning. Contextual retrieval (prepend a 50-token summary of the containing agreement) adds another ~35% recall boost on LegalBench-RAG.

Alternatives: Paragraph-level, Fixed 512-token

EmbeddingsVoyage-3-large

Voyage-3-large tops MTEB on legal subsets and is cheaper than voyage-law-2 while being only ~2% behind on clause-retrieval tasks. For pure contract review, voyage-law-2 is worth the premium.

Alternatives: voyage-law-2 (legal-specific), Cohere Embed v3

Vector DBQdrant with payload filtering

Metadata filtering must run before vector search (matter, client, privilege). Qdrant’s payload indexes make this fast even at 50M+ vectors, and native BM25 removes a whole extra service.

Alternatives: Weaviate, pgvector

RerankerVoyage Rerank 2.5

Adds 200-300ms but lifts precision-at-10 from ~65% to ~88% on legal queries. Precision is the metric that matters here — missing a relevant document is recoverable, surfacing an irrelevant one is wasted lawyer time.

Alternatives: Cohere Rerank v3, ColBERTv2

Answer LLMClaude Opus 4 for high-stakes, Sonnet 4 for routine

Claude Opus 4 has the best verbatim-quote accuracy and rarely fabricates citations. Use Gemini 2.5 Pro (2M context) when the question requires reasoning across an entire 500-page agreement without chunking.

Alternatives: GPT-4o, Gemini 2.5 Pro

Citation validatorCustom regex + exact-match validator

Every citation the model emits must resolve to a chunk ID that was in the context. Any mismatch triggers regeneration or a visible ‘unverified citation’ warning. Non-negotiable for legal deployments.

Alternatives: Guardrails AI, NeMo Guardrails

Cost at each scale

Prototype

50k pages · 1k queries/mo · 1 matter

$480/mo

OCR + layout parse (Azure DI)$75
One-time embedding (Voyage-3-large)$35
Query embeddings (1k)$1
Voyage Rerank 2.5 (1k × top-100)$10
Claude Sonnet 4 answers (1k × ~6k tokens)$150
Qdrant Cloud starter$79
Hosting + observability + PDF viewer$130

Startup

500k pages · 20k queries/mo · 50 matters

$9,800/mo

OCR + layout (new + re-OCR)$400
Incremental embedding (Voyage-3-large)$120
Query embeddings (20k)$15
Voyage Rerank 2.5 (20k × top-100)$200
Claude Opus 4 high-stakes (5k × ~8k tok)$3,500
Claude Sonnet 4 routine (15k × ~6k tok)$2,200
Qdrant Cloud standard$900
Ingestion + validator + observability$1,400
Infra + SOC2 hosting$1,065

Scale

10M pages · 250k queries/mo · 2000 matters

$145,000/mo

OCR + layout (ongoing)$6,000
Embeddings (churn + new)$4,500
Query embeddings (250k)$180
Voyage Rerank 2.5 (250k × top-100)$2,500
Claude Opus 4 + Sonnet 4 mix$85,000
Qdrant Enterprise (self-hosted, HA)$14,000
Ingestion + validator + audit logging$12,000
SOC2 + HIPAA hosting + observability$20,820

Latency budget

Total P50: 4,840ms
Total P95: 8,810ms
Metadata pre-filter
30ms · 80ms p95
Query embedding
110ms · 220ms p95
Hybrid retrieval (top-100)
160ms · 340ms p95
Rerank to top-10
280ms · 520ms p95
LLM answer (Opus 4, streamed)
4200ms · 7500ms p95
Citation validator
60ms · 150ms p95
Median
P95

Tradeoffs

Precision vs recall

General RAG optimizes recall — miss nothing. Legal RAG inverts this: a false positive (hallucinated citation) is catastrophic, a false negative is recoverable via broader search. Bias toward fewer, higher-confidence citations; always show the retrieved-but-not-cited candidates so a human can drill deeper.

Opus 4 vs Sonnet 4 routing

Opus 4 is 5x the cost of Sonnet 4 but produces noticeably cleaner citations on complex cross-document reasoning. Route by matter value or user role: partner-level diligence uses Opus, associate research uses Sonnet. Blended cost lands ~2.3x Sonnet-only — worth it for the quality floor.

Chunk-level RAG vs long-context whole-document

Gemini 2.5 Pro has 2M-token context. For ‘summarize this 500-page MSA’, stuffing the whole doc beats chunked retrieval. Use long-context for whole-document questions; use chunked RAG for cross-corpus search. Build both paths and route by query type.

Failure modes & guardrails

LLM fabricates paragraph numbers or page citations

Mitigation: Every citation must resolve to a chunk ID included in the context. A post-generation validator extracts citations, verifies verbatim match against retrieved chunks, and either regenerates or marks the answer ‘unverified’. Never surface an unvalidated citation.

Privileged documents leak across matters

Mitigation: Matter ID and privilege flag are MANDATORY metadata filters applied before retrieval, not after. Treat like multi-tenancy: a query from matter A can never retrieve a chunk from matter B. Log every retrieval with matter context for audit.

Scanned PDFs have garbage OCR text

Mitigation: Run confidence scoring on OCR output. Any page below 85% confidence gets flagged for human re-OCR or reprocessing through Azure DI’s Layout model. Never index low-confidence text — it pollutes retrieval for every query touching that doc.

Clause extraction misses renamed sections

Mitigation: Maintain a clause-type classifier (indemnification, LOL, confidentiality, term, etc.) trained on a labeled corpus. Run at ingest to normalize metadata. Every 90 days, re-classify and diff against the previous pass to catch drift in drafting patterns.

Answer confidence is opaque to reviewing attorney

Mitigation: Surface a confidence signal: rerank score of top citation, number of supporting chunks, and a binary ‘verified citation’ flag. Lawyers must see at a glance whether this answer is high-confidence or exploratory. Hide the signal and they stop trusting the tool.

Frequently asked questions

How is legal RAG different from general enterprise search?

Three ways: (1) precision beats recall — fabricated citations are catastrophic; (2) clause-level structure matters more than page-level, so chunking must respect section boundaries; (3) privilege and matter-level access control is legally mandatory, not optional. A general-purpose RAG stack will fail on all three without legal-specific tuning.

Which embedding model is best for legal text in 2026?

Voyage-3-large is the best general-purpose choice — top MTEB on legal subsets. voyage-law-2 is ~2% better on pure contract clause retrieval but costs more. Cohere Embed v3 and OpenAI text-embedding-3-large trail by 12-18% on legal benchmarks like LegalBench-RAG.

Should I use Claude Opus 4 or Sonnet 4 for legal answers?

Route by stakes. Opus 4 for M&A diligence, litigation, and any answer going into a memo or brief. Sonnet 4 for associate research and quick contract lookups. Opus 4 costs ~5x more per token but has materially cleaner citation behavior — it almost never fabricates paragraph numbers, which is non-negotiable for litigation prep.

How do I handle the 2M-token context window in Gemini 2.5 Pro?

Use it for whole-document questions — ‘summarize this 500-page agreement’, ‘list every indemnification clause in this MSA’. Stuffing a full contract into Gemini beats chunked retrieval for cross-section reasoning inside one doc. For cross-corpus search across 1000 contracts, chunked RAG is still the right tool — no context window replaces retrieval at that scale.

How do I prevent privilege or matter leakage across queries?

Matter ID and privilege flag must be mandatory metadata filters applied BEFORE vector search, not after ranking. Architecturally treat this like multi-tenant SaaS: a request from matter A can never see a chunk from matter B. Log every retrieval with matter context for audit and malpractice insurance review.

Do I need OCR or can I just use the PDF text layer?

For older filings and scanned contracts, the PDF text layer is often garbage or absent. Use Azure Document Intelligence’s Layout model or Unstructured Premium. Run confidence scoring and re-OCR anything under 85% — low-quality OCR silently poisons retrieval quality for every query that touches that document.

How do I validate that a citation is real?

Post-generation, extract every (doc_id, page, paragraph) tuple the model emits. Verify each resolves to a chunk included in the LLM’s context. Verify the quoted text appears verbatim in that chunk. Any mismatch: regenerate the answer or return it with a visible ‘unverified citation’ warning. Never silently surface an unvalidated citation.

What does this cost for a mid-sized law firm?

A 50-lawyer firm running 500k pages and 20k queries/month lands around $9-12k/month all-in. Scale to an AmLaw 100 (10M+ pages, 250k queries/month) and it is $130-160k/month — still well under a single associate’s billable equivalent.

Related

Tools mentioned