Reference Architecture · rag

Multimodal RAG (Text + Images + PDFs)

Last updated: April 16, 2026

Quick answer

Use a VLM-based parser (ColPali, Unstructured Premium, or Landing AI Document Extraction) to split PDFs into layout-aware regions, embed text with Voyage-3-large and images with Voyage-Multimodal-3 or Cohere Embed Multimodal v3, store both in the same vector space in Qdrant, rerank with a cross-encoder, and synthesize with Claude Sonnet 4 or GPT-4o (both strong vision models). Expect $0.20 to $0.80 per query. The single biggest lever is replacing PyMuPDF text extraction with a VLM-based parser that preserves layout — it lifts end-to-end accuracy 30-50%.

The problem

You have 100k research papers, product manuals, or catalog PDFs where the real information lives in figures, tables, schematics, and photos — not just text. A user asks ‘show me all valves with a diameter above 50mm and a PTFE seat’ and the answer is encoded in a table on page 47 of a scanned datasheet. Pure text RAG drops this content on the floor. You need a pipeline that understands layout, embeds images, OCRs tables, and cites the exact page region.

Architecture

answer + bbox citationsVLM Document ParserINPUTMultimodal ChunkerDATAText Embedding ModelLLMMultimodal Embedding ModelLLMUnified Vector StoreDATAMultimodal RetrieverINFRAMultimodal RerankerLLMVLM Answer SynthesizerLLMViewer + CitationsOUTPUT
input
llm
data
infra
output

VLM Document Parser

Runs each PDF page through a layout-aware VLM. Extracts text, tables (as structured JSON), figures (as cropped images), captions, and reading order. Crucial for scanned or layout-complex docs.

Alternatives: Unstructured Premium, ColPali (self-hosted), Nougat, Azure Document Intelligence

Multimodal Chunker

Groups text paragraphs, table JSON, and figure images by page region. Each chunk has type (text/table/figure), page_number, bbox coordinates, and a caption reference.

Alternatives: Unstructured chunking, Layout-aware regex, Contextual retrieval for tables

Text Embedding Model

Embeds text chunks and table-as-markdown serializations. Voyage-3-large handles both well in a single model.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

Multimodal Embedding Model

Embeds cropped figures and full-page renders into the SAME vector space as text, so a text query can match an image directly.

Alternatives: Cohere Embed Multimodal v3, OpenCLIP ViT-L, Jina CLIP v2

Unified Vector Store

Qdrant collection holding text, table, and image vectors side by side, all with a type field. Retrieval can filter by modality or span all of them.

Alternatives: Weaviate (multi2vec), pgvector + separate tables, LanceDB

Multimodal Retriever

Runs the text query against both text and image indexes in parallel. Merges with RRF. Returns a blend of text passages and image regions — the retriever does not have to pick a modality upfront.

Alternatives: Modality-specific routers, ColPali late-interaction (image-native), Weighted RRF

Multimodal Reranker

For text-only candidates use Voyage Rerank 2.5. For mixed candidates, pass the top 20 to a small VLM (GPT-4o-mini vision or Claude Haiku 4 vision) with the query and let it rank.

Alternatives: voyage-rerank-2.5, claude-haiku-4-vision

VLM Answer Synthesizer

Receives text, tables (as markdown), and image regions as input. Generates an answer with citations that point to (doc, page, bbox) so the UI can highlight the exact region.

Alternatives: GPT-4o, Gemini 2.5 Pro

Viewer + Citations

Answer next to a PDF viewer that highlights the exact bounding boxes referenced. Image crops shown inline so the user can verify without scrolling.

Alternatives: PSPDFKit, PDF.js custom, Custom React + canvas overlay

The stack

PDF parsingLanding AI ADE or Unstructured Premium

VLM-based parsers preserve reading order and extract tables as structured JSON, not as broken text runs. Single biggest quality lever in multimodal RAG. PyMuPDF and PyPDF2 are a starting point but leak 30-50% of the information in real-world PDFs.

Alternatives: ColPali (self-hosted), Azure Document Intelligence, Nougat

Text embeddingsVoyage-3-large

Best general text embeddings in 2026 MTEB. Also handles table-as-markdown better than most, which matters when ~30% of your chunks are serialized tables.

Alternatives: OpenAI text-embedding-3-large, Cohere Embed v3

Image embeddingsVoyage-Multimodal-3

Voyage-Multimodal-3 and Cohere Embed Multimodal v3 both project text and images into the SAME vector space — a text query retrieves matching images directly, no modality-specific routing needed. Major simplification over older CLIP-style stacks.

Alternatives: Cohere Embed Multimodal v3, OpenCLIP ViT-L, Jina CLIP v2

Vector storeQdrant with type-filtered collections

One collection with a ‘modality’ payload field keeps the retrieval pipeline simple and lets the reranker see candidates across modalities. Weaviate multi2vec is competitive but adds operational complexity.

Alternatives: Weaviate multi2vec, LanceDB

RerankerVoyage Rerank 2.5 for text, VLM for mixed

For text-heavy queries, a text reranker is fine. When the top 20 candidates mix text and images, a small vision model ranks across modalities correctly — it understands whether a figure or a paragraph actually answers the question.

Alternatives: claude-haiku-4-vision, gpt-4o-mini-vision

Answer VLMClaude Sonnet 4 or GPT-4o

Both Claude Sonnet 4 and GPT-4o have strong vision reasoning. Claude is slightly better at tables and layout; GPT-4o edges ahead on general image understanding. Pick one and stick with it — mixing creates prompt-engineering drift.

Alternatives: Gemini 2.5 Pro, Claude Opus 4

Ingestion computeGPU workers for VLM parsing

VLM parsing is slow — 5-15 seconds per page. At 100k+ documents, you need parallel GPU workers. Managed services (Landing AI ADE, Unstructured Premium) are the sane default unless you have ML infra already.

Alternatives: Managed (Landing AI, Unstructured), Modal or RunPod

Cost at each scale

Prototype

10k pages · 2k queries/mo

$420/mo

VLM parsing (Landing AI, 10k pages)$150
Text embeddings (Voyage-3-large)$8
Image embeddings (Voyage Multimodal-3)$15
Query embeddings (2k)$2
Reranker (Voyage + Haiku vision mix)$20
Claude Sonnet 4 answers (vision, 2k × ~10k tok)$100
Qdrant Cloud starter$79
Hosting + viewer + observability$46

Startup

500k pages · 50k queries/mo

$9,600/mo

VLM parsing (churn + new pages)$1,500
Text + image embeddings$400
Query embeddings (50k)$50
Reranker (50k × mixed)$300
Claude Sonnet 4 answers (vision)$4,800
Qdrant Cloud standard$900
GPU ingestion workers + observability$900
Infra + hosting$750

Scale

10M pages · 500k queries/mo

$118,000/mo

VLM parsing (ongoing + churn)$18,000
Text + image embeddings$4,500
Query embeddings (500k)$400
Multimodal rerank (500k)$3,500
Claude Sonnet 4 vision answers$55,000
Qdrant Enterprise self-hosted$9,000
GPU ingestion pool + image CDN$12,000
Evals + observability + hosting$15,600

Latency budget

Total P50: 4,050ms
Total P95: 7,540ms
Document parsing (offline, batch)
0ms · 0ms p95
Query embedding (text + image projection)
120ms · 260ms p95
Parallel retrieval (text + image indexes)
180ms · 380ms p95
Multimodal rerank top-20 to top-5
550ms · 1100ms p95
VLM answer synthesis (streamed)
3200ms · 5800ms p95
Median
P95

Tradeoffs

Unified vector space vs modality-specific routers

Modern multimodal embeddings (Voyage Multimodal-3, Cohere Embed Multimodal v3) project text and images into the SAME space, so a single query retrieves across modalities. Older stacks used separate indexes + a modality router, which is more code and more failure modes. Use unified embeddings unless you have benchmark data showing your domain needs separation.

VLM parsing cost vs pypdf/text-only

PyMuPDF is free and fast but drops 30-50% of the information in layout-heavy PDFs (tables, figures, multi-column). VLM parsing is 50-100x more expensive at ingest but lifts end-to-end retrieval accuracy by 30-50%. Pay the ingest cost; cheap parsing is the hidden killer of multimodal RAG quality.

Claude Sonnet 4 vs GPT-4o vs Gemini 2.5 Pro for vision

Claude Sonnet 4 leads on tables and structured layouts. GPT-4o is best on general image understanding (photos, scenes). Gemini 2.5 Pro wins on long docs thanks to 2M context. Pick by content type: engineering docs → Claude, product catalogs → GPT-4o, book-length → Gemini.

Failure modes & guardrails

Tables are extracted as broken text runs

Mitigation: Replace PyMuPDF/PyPDF2 with a VLM-based parser (Landing AI ADE, Unstructured Premium, ColPali). Validate table extraction by sampling 50 random tables weekly and diffing against the source PDF — alert on any run where >5% of sampled tables are malformed.

Image embeddings miss text baked into figures

Mitigation: Run OCR on every extracted figure at ingest. Store the OCRed text as a sibling text chunk with the same page/bbox metadata. Embed both the image and the OCRed text — the text index catches keyword queries, the image index catches conceptual ones.

Scanned low-DPI PDFs produce unusable crops

Mitigation: Detect page DPI at ingest. For pages under 200 DPI, trigger a super-resolution pass (Real-ESRGAN or commercial equivalent) before parsing. Flag any doc where >25% of pages are low-DPI and send for human review before indexing.

Citations point to wrong page regions

Mitigation: Every chunk stores (page_number, bbox). The LLM is instructed to emit citations as (doc_id, page, bbox_id). A post-generation validator confirms the bbox_id exists in the retrieved context. The UI highlights exactly that region — no approximate highlighting.

Cross-modal retrieval is biased toward one modality

Mitigation: Track per-modality recall weekly against a labeled eval set. If text consistently wins over images (or vice versa), tune the RRF weights or retrieve top-k separately per modality before merging. Do not ship a retriever you have not measured per-modality.

Frequently asked questions

What is multimodal RAG and when do I need it?

Multimodal RAG indexes and retrieves across text, tables, figures, and images — not just paragraphs. You need it whenever more than ~20% of your answer-worthy information lives in non-text form: product catalogs, technical manuals, research papers with figures, datasheets, medical imaging reports. If your docs are essentially wall-of-text articles, plain text RAG is fine.

Which multimodal embedding model should I use in 2026?

Voyage Multimodal-3 and Cohere Embed Multimodal v3 are the top two. Both project text and images into a shared vector space so a text query retrieves images directly. Voyage edges ahead on MTEB multimodal benchmarks; Cohere is competitive and often cheaper at volume. CLIP variants (OpenCLIP, Jina CLIP v2) trail on complex document images.

Do I really need VLM-based PDF parsing?

For layout-complex documents (tables, multi-column, figures, scans), yes. PyMuPDF and PyPDF2 extract ~50-70% of the information; a VLM parser like Landing AI ADE, Unstructured Premium, or ColPali extracts 90%+. The ingest cost is higher but it’s a one-time hit and lifts end-to-end retrieval accuracy by 30-50%.

Which LLM is best for vision-based answers?

Claude Sonnet 4 leads on tables and engineering layouts. GPT-4o is best on general imagery (photos, product shots). Gemini 2.5 Pro wins when you need to reason across a whole 500+ page document thanks to the 2M context window. Don’t mix — pick one per content type and stabilize your prompts.

How do I cite a specific region of a PDF page?

Store (doc_id, page_number, bounding_box) as chunk metadata at ingest. Instruct the LLM to emit citations in that format. On render, the UI uses the bbox to highlight the exact region in a PDF viewer (PSPDFKit or PDF.js with a canvas overlay). This is what turns a demo into a product lawyers and engineers will actually trust.

Can I use ColPali instead of chunked embeddings?

Yes. ColPali is a late-interaction multimodal retriever that operates on page-level patches and tends to beat chunked approaches on visually complex documents. Tradeoff: higher storage cost and more compute per query. Use it when retrieval quality on figures is the bottleneck and you can afford the index overhead.

How much does multimodal RAG cost compared to text-only?

2-4x more expensive per query, driven by VLM parsing at ingest and vision tokens at answer time. At 500k pages and 50k queries/month, budget ~$9-11k/month all-in vs ~$3-4k for text-only RAG on the same corpus. The quality gain is usually worth it when images actually carry the information.

Can I skip embeddings and just stuff images into Gemini 2.5 Pro’s 2M context?

For single-document questions, yes — it works and often beats chunked RAG on layout reasoning. For cross-document search across 100k+ PDFs, you still need retrieval. The right pattern is hybrid: chunked multimodal RAG for retrieval, then long-context VLM for reasoning inside the top-few docs.

Related