Reference Architecture · multimodal

Image-Based Search (Visual Similarity + Text Query)

Last updated: April 16, 2026

Quick answer

Index images with a multimodal embedding model (OpenCLIP, SigLIP, or Voyage-multimodal-3), store in a vector DB with ANN search (Pinecone, Qdrant, pgvector+HNSW), and layer a reranker for top-K results. For nuanced queries, use Gemini 2.5 Pro or GPT-4o vision to rerank the top 50 candidates. Combine with structured filters (price, size, brand) as metadata. Expect $0.0001-$0.001 per query at scale with P95 under 180ms on the fast path.

The problem

Users want to find products, media, or images using either an uploaded photo ('find me shoes like these') or a descriptive text query ('red leather ankle boots under $150') - or both combined with filters. Traditional keyword search misses 40-70% of matches because catalog titles don't describe what images actually look like. You need a system that indexes images by visual content, supports text-to-image and image-to-image queries, and handles millions to billions of items with sub-200ms latency.

Architecture

metadata filterstop 50Catalog IngestINPUTImage PreprocessingINFRAImage Embedding ModelLLMAttribute Extractor (Vision LLM)LLMVector Index (ANN)DATAQuery EncoderLLMFilter + Vector MergeINFRACross-Encoder RerankerLLMSearch Results UIOUTPUT
input
llm
data
infra
output

Catalog Ingest

Receives new products/images with metadata (title, price, brand, category). Triggers re-indexing.

Alternatives: Shopify webhooks, Algolia ingest, Custom queue

Image Preprocessing

Normalizes images (resize, center-crop, strip EXIF), removes backgrounds for product shots, generates thumbnails.

Alternatives: Cloudinary, imgproxy, Sharp.js, Pillow

Image Embedding Model

Encodes each image into a dense vector (512-1024 dim). Uses a multimodal model so images and text share the embedding space.

Alternatives: OpenCLIP ViT-L/14, SigLIP 2, CLIP ViT-H/14, Cohere Embed v3 multimodal

Attribute Extractor (Vision LLM)

Runs once per product at index time. Extracts structured attributes (color, material, style, pattern) for filter search.

Alternatives: GPT-4o vision, Claude Sonnet 4 vision, Gemini 2.0 Flash for cost

Vector Index (ANN)

HNSW or IVF-PQ index for fast approximate nearest neighbor search. Stores image vectors plus metadata filters.

Alternatives: Qdrant, pgvector + HNSW, Weaviate, Milvus

Query Encoder

Encodes the user's query (text or image) into the same vector space. For text-to-image: runs text encoder. For image-to-image: runs image encoder on the uploaded photo.

Alternatives: OpenCLIP text encoder, SigLIP 2 text encoder

Filter + Vector Merge

Combines vector similarity results with structured filters (price range, size, in-stock, brand). Either pre-filter (metadata first, then vectors) or post-filter (vectors first, then metadata).

Alternatives: Pinecone hybrid, Qdrant filters, Typesense hybrid

Cross-Encoder Reranker

Takes the top-K ANN candidates and re-scores using a stronger (slower) model that attends to query and candidate together. For nuanced queries, uses a vision LLM.

Alternatives: GPT-4o vision reranker, Claude Sonnet 4 vision, Voyage Rerank

Search Results UI

Grid of results with highlight on matched attributes, similar-item carousel, and visual filter chips.

Alternatives: Algolia UI Library, Custom React, Nuxt + shadcn

The stack

Multimodal embeddingVoyage-multimodal-3

Voyage-multimodal-3 leads MTEB multimodal retrieval in 2026. SigLIP 2 is the strongest open-source model and self-hostable. OpenCLIP is the community default but 2-3 years old. For e-commerce specifically, fine-tune a SigLIP 2 on your catalog for a 10-15% accuracy bump.

Alternatives: OpenCLIP ViT-L/14, SigLIP 2, Cohere Embed v3 multimodal

Attribute extractionGemini 2.5 Pro vision (batched)

Attribute extraction happens once per product at index time, so cost matters less than quality. Gemini 2.5 Pro extracts structured attributes (color, pattern, material) reliably. For massive catalogs, Gemini 2.0 Flash at $0.075/$0.30 per MTok is 15x cheaper with acceptable accuracy.

Alternatives: GPT-4o vision, Claude Sonnet 4 vision, Gemini 2.0 Flash

Vector indexPinecone (managed) or Qdrant (self-hosted)

Pinecone is the easiest managed option with proven scale. Qdrant is the best self-hosted choice - faster filtering than Weaviate and simpler ops than Milvus. pgvector works up to ~10M vectors but degrades past that. For 100M+ vectors, Qdrant or Milvus.

Alternatives: pgvector + HNSW, Weaviate, Milvus

RerankerCohere Rerank v3

Cohere Rerank v3 and Voyage Rerank give 15-30% better top-5 accuracy than ANN alone at $1-$2 per 1k queries. For highly nuanced queries ('statement piece for a rooftop dinner party'), a vision LLM reranker on the top-20 gives another 5-10% but costs 50-100x more.

Alternatives: Voyage Rerank, GPT-4o vision, Claude Sonnet 4 vision, Custom cross-encoder

Serving + CDNCloudinary or imgproxy + Cloudflare

Cloudinary handles transformations, formats (WebP, AVIF), and CDN. imgproxy is the best open alternative. Serve product thumbnails at 200-400px; full images lazy-loaded. Fast image delivery is as important as fast search.

Alternatives: AWS CloudFront + S3, BunnyCDN, Fastly

Observability + evalsAlgolia Analytics + custom

Track click-through rate on position 1-10, zero-result queries, and query-to-purchase conversion. Maintain a golden query set (100-500 queries with expected results) and re-eval on every embedding or reranker change.

Alternatives: Typesense analytics, Custom ClickHouse, Braintrust

Cost at each scale

Prototype

50,000 catalog items, 100k queries/mo

$95/mo

One-time embedding (Voyage multimodal)$25
Gemini 2.5 Pro attribute extraction$30
Pinecone Starter$0
Query encoding (batched)$12
Cohere Rerank v3$18
Hosting$10

Startup

2M items, 5M queries/mo

$4,200/mo

Voyage multimodal (new + refreshes)$650
Gemini 2.0 Flash attributes$450
Pinecone Standard (10M vectors)$800
Query encoding$350
Cohere Rerank v3$950
Image CDN (Cloudinary)$700
Hosting + observability$300

Scale

100M items, 500M queries/mo

$145,000/mo

Self-hosted SigLIP 2 fine-tuned (GPU cluster)$22,000
Gemini 2.0 Flash attribute extraction$14,000
Self-hosted Qdrant cluster$18,000
Query encoding at edge$8,500
Cohere Rerank + vision LLM tail$48,000
Image CDN + transform$24,000
SRE + observability$10,500

Latency budget

Total P50: 1,128ms
Total P95: 2,237ms
Query encoding (text or image)
25ms · 70ms p95
ANN vector search
15ms · 45ms p95
Filter merge
8ms · 22ms p95
Cohere rerank (top 50)
90ms · 180ms p95
LLM vision rerank (top 10, optional)
850ms · 1600ms p95
End-to-end fast path
140ms · 320ms p95
Median
P95

Tradeoffs

Pre-filter vs post-filter with metadata

Pre-filter (apply metadata filter first, then vector search) is faster when filters narrow results aggressively (e.g., 'size 9 boots' from a 10M catalog drops to 50k). Post-filter (vector search first, then filter) is faster when filters are weak or the vector search is very selective. Pinecone and Qdrant both do this adaptively; watch your P95 and tune per query pattern.

CLIP/SigLIP vs fine-tuned embeddings

Out-of-the-box CLIP/SigLIP works well for general catalogs but underperforms on niche domains (fashion, furniture, medical imaging) by 10-25% top-5 accuracy. Fine-tune on 50k-200k in-domain pairs for a substantial bump. The tradeoff is infrastructure: fine-tuning and redeploying requires GPU ops.

Vision-LLM reranker vs cross-encoder reranker

A vision LLM reranker (GPT-4o, Claude Sonnet 4) gives the best quality on nuanced queries ('something that would look good at a wedding') but costs 50-100x more than Cohere/Voyage Rerank. Use the LLM reranker on only the top 5-10 candidates and only for queries the classifier identifies as nuanced - not for 'red sneakers size 9'.

Failure modes & guardrails

Visual ambiguity - user uploads photo with multiple objects

Mitigation: Run a segmentation pass (SAM 2 or a YOLO model) to identify distinct objects. Present them to the user as tappable chips: 'Search by this shirt / these shoes / this bag'. Do not silently pick one. For product search, assume the largest centered object unless user refines.

Text query has precise attributes that visual search misses (e.g., 'under $150')

Mitigation: Parse attributes out of the text query with a small LLM (GPT-4o-mini). Apply $150 as a price filter, not as part of the embedding vector. Embeddings capture visual style and product type; structured filters capture price/size/stock.

Zero-result queries hurt UX and signal catalog gaps

Mitigation: Track zero-result queries in observability. Never return an empty page - show closest matches with a 'no exact match, here are similar items' banner. Feed zero-result queries to merchandising and the fine-tuning pipeline.

Adult/unsafe image uploaded for image-to-image search

Mitigation: Run a moderation pass on any user-uploaded image (AWS Rekognition or Gemini safety classifier) before embedding and search. Reject or sandbox unsafe queries. This protects your catalog index from adversarial inputs and your team from exposure.

Catalog drift - new products added but index not refreshed

Mitigation: Index new products within 5-15 min of catalog write, not nightly. Use Pinecone/Qdrant streaming upserts. Track the index-freshness lag as a SLO and alert when it exceeds 30 min. Stale indexes hurt conversion on new-arrival collections immediately.

Frequently asked questions

CLIP, SigLIP, or Voyage multimodal?

Voyage-multimodal-3 leads on benchmarks in 2026 and is managed API. SigLIP 2 is the best open-source model and self-hostable. OpenCLIP ViT-L/14 is the community default but showing age. For new builds: Voyage if you want managed, SigLIP 2 if you want self-hosted and fine-tuning.

Pinecone, Qdrant, or pgvector?

Pinecone: best managed experience, strong at 10M+ vectors, $70-$500/month at small-medium scale. Qdrant: best self-hosted, great filter performance, harder ops. pgvector: fine if you're already on Postgres and have under 5-10M vectors. Above 100M, Qdrant or Milvus.

Do I need a reranker?

Yes for anything past 1M items. ANN alone returns candidates in the right neighborhood but the top 10 ordering is often off by 20-40%. Cohere Rerank v3 or Voyage Rerank on the top 50 fixes this for $1-$2 per 1k queries. Skip only if you are optimizing for ultra-low latency at small catalogs.

Can I combine text and image in one query?

Yes - 'these shoes but in brown' is the ideal multimodal query. Encode the image, encode the text delta ('in brown'), and average or concatenate with weights. Voyage-multimodal-3 supports this natively. For best results, let the user upload image + type a delta; then fine-tune the weighting on clicks.

How do I evaluate search quality?

Track CTR@5 (did the user click one of the top 5 results?), zero-result rate, and query-to-purchase conversion. Maintain a golden query set of 200-500 labeled queries with expected results. Re-eval on every embedding change, reranker change, or filter rule change. Braintrust is a good fit.

How much does image search cost?

At 500M queries/month with a 100M item catalog, budget $100k-$180k/month all-in (embeddings, vector index, rerank, CDN). Per-query cost: $0.0002-$0.0005. The big cost driver is image storage/CDN, not the search layer.

Should I fine-tune the embedding model?

Only if you have 50k+ in-domain training pairs (image plus relevant text or image-image pairs) and a clear 10%+ accuracy gap. Fine-tuning SigLIP 2 or CLIP on 100k fashion pairs commonly yields 15-25% better top-5 accuracy on fashion catalogs. General stuff? Don't bother - Voyage multimodal is good enough.

Related

Tools mentioned