Reference Architecture · multimodal
Image-Based Search (Visual Similarity + Text Query)
Last updated: April 16, 2026
Quick answer
Index images with a multimodal embedding model (OpenCLIP, SigLIP, or Voyage-multimodal-3), store in a vector DB with ANN search (Pinecone, Qdrant, pgvector+HNSW), and layer a reranker for top-K results. For nuanced queries, use Gemini 2.5 Pro or GPT-4o vision to rerank the top 50 candidates. Combine with structured filters (price, size, brand) as metadata. Expect $0.0001-$0.001 per query at scale with P95 under 180ms on the fast path.
The problem
Users want to find products, media, or images using either an uploaded photo ('find me shoes like these') or a descriptive text query ('red leather ankle boots under $150') - or both combined with filters. Traditional keyword search misses 40-70% of matches because catalog titles don't describe what images actually look like. You need a system that indexes images by visual content, supports text-to-image and image-to-image queries, and handles millions to billions of items with sub-200ms latency.
Architecture
Catalog Ingest
Receives new products/images with metadata (title, price, brand, category). Triggers re-indexing.
Alternatives: Shopify webhooks, Algolia ingest, Custom queue
Image Preprocessing
Normalizes images (resize, center-crop, strip EXIF), removes backgrounds for product shots, generates thumbnails.
Alternatives: Cloudinary, imgproxy, Sharp.js, Pillow
Image Embedding Model
Encodes each image into a dense vector (512-1024 dim). Uses a multimodal model so images and text share the embedding space.
Alternatives: OpenCLIP ViT-L/14, SigLIP 2, CLIP ViT-H/14, Cohere Embed v3 multimodal
Attribute Extractor (Vision LLM)
Runs once per product at index time. Extracts structured attributes (color, material, style, pattern) for filter search.
Alternatives: GPT-4o vision, Claude Sonnet 4 vision, Gemini 2.0 Flash for cost
Vector Index (ANN)
HNSW or IVF-PQ index for fast approximate nearest neighbor search. Stores image vectors plus metadata filters.
Alternatives: Qdrant, pgvector + HNSW, Weaviate, Milvus
Query Encoder
Encodes the user's query (text or image) into the same vector space. For text-to-image: runs text encoder. For image-to-image: runs image encoder on the uploaded photo.
Alternatives: OpenCLIP text encoder, SigLIP 2 text encoder
Filter + Vector Merge
Combines vector similarity results with structured filters (price range, size, in-stock, brand). Either pre-filter (metadata first, then vectors) or post-filter (vectors first, then metadata).
Alternatives: Pinecone hybrid, Qdrant filters, Typesense hybrid
Cross-Encoder Reranker
Takes the top-K ANN candidates and re-scores using a stronger (slower) model that attends to query and candidate together. For nuanced queries, uses a vision LLM.
Alternatives: GPT-4o vision reranker, Claude Sonnet 4 vision, Voyage Rerank
Search Results UI
Grid of results with highlight on matched attributes, similar-item carousel, and visual filter chips.
Alternatives: Algolia UI Library, Custom React, Nuxt + shadcn
The stack
Voyage-multimodal-3 leads MTEB multimodal retrieval in 2026. SigLIP 2 is the strongest open-source model and self-hostable. OpenCLIP is the community default but 2-3 years old. For e-commerce specifically, fine-tune a SigLIP 2 on your catalog for a 10-15% accuracy bump.
Alternatives: OpenCLIP ViT-L/14, SigLIP 2, Cohere Embed v3 multimodal
Attribute extraction happens once per product at index time, so cost matters less than quality. Gemini 2.5 Pro extracts structured attributes (color, pattern, material) reliably. For massive catalogs, Gemini 2.0 Flash at $0.075/$0.30 per MTok is 15x cheaper with acceptable accuracy.
Alternatives: GPT-4o vision, Claude Sonnet 4 vision, Gemini 2.0 Flash
Pinecone is the easiest managed option with proven scale. Qdrant is the best self-hosted choice - faster filtering than Weaviate and simpler ops than Milvus. pgvector works up to ~10M vectors but degrades past that. For 100M+ vectors, Qdrant or Milvus.
Alternatives: pgvector + HNSW, Weaviate, Milvus
Cohere Rerank v3 and Voyage Rerank give 15-30% better top-5 accuracy than ANN alone at $1-$2 per 1k queries. For highly nuanced queries ('statement piece for a rooftop dinner party'), a vision LLM reranker on the top-20 gives another 5-10% but costs 50-100x more.
Alternatives: Voyage Rerank, GPT-4o vision, Claude Sonnet 4 vision, Custom cross-encoder
Cloudinary handles transformations, formats (WebP, AVIF), and CDN. imgproxy is the best open alternative. Serve product thumbnails at 200-400px; full images lazy-loaded. Fast image delivery is as important as fast search.
Alternatives: AWS CloudFront + S3, BunnyCDN, Fastly
Track click-through rate on position 1-10, zero-result queries, and query-to-purchase conversion. Maintain a golden query set (100-500 queries with expected results) and re-eval on every embedding or reranker change.
Alternatives: Typesense analytics, Custom ClickHouse, Braintrust
Cost at each scale
Prototype
50,000 catalog items, 100k queries/mo
$95/mo
Startup
2M items, 5M queries/mo
$4,200/mo
Scale
100M items, 500M queries/mo
$145,000/mo
Latency budget
Tradeoffs
Pre-filter vs post-filter with metadata
Pre-filter (apply metadata filter first, then vector search) is faster when filters narrow results aggressively (e.g., 'size 9 boots' from a 10M catalog drops to 50k). Post-filter (vector search first, then filter) is faster when filters are weak or the vector search is very selective. Pinecone and Qdrant both do this adaptively; watch your P95 and tune per query pattern.
CLIP/SigLIP vs fine-tuned embeddings
Out-of-the-box CLIP/SigLIP works well for general catalogs but underperforms on niche domains (fashion, furniture, medical imaging) by 10-25% top-5 accuracy. Fine-tune on 50k-200k in-domain pairs for a substantial bump. The tradeoff is infrastructure: fine-tuning and redeploying requires GPU ops.
Vision-LLM reranker vs cross-encoder reranker
A vision LLM reranker (GPT-4o, Claude Sonnet 4) gives the best quality on nuanced queries ('something that would look good at a wedding') but costs 50-100x more than Cohere/Voyage Rerank. Use the LLM reranker on only the top 5-10 candidates and only for queries the classifier identifies as nuanced - not for 'red sneakers size 9'.
Failure modes & guardrails
Visual ambiguity - user uploads photo with multiple objects
Mitigation: Run a segmentation pass (SAM 2 or a YOLO model) to identify distinct objects. Present them to the user as tappable chips: 'Search by this shirt / these shoes / this bag'. Do not silently pick one. For product search, assume the largest centered object unless user refines.
Text query has precise attributes that visual search misses (e.g., 'under $150')
Mitigation: Parse attributes out of the text query with a small LLM (GPT-4o-mini). Apply $150 as a price filter, not as part of the embedding vector. Embeddings capture visual style and product type; structured filters capture price/size/stock.
Zero-result queries hurt UX and signal catalog gaps
Mitigation: Track zero-result queries in observability. Never return an empty page - show closest matches with a 'no exact match, here are similar items' banner. Feed zero-result queries to merchandising and the fine-tuning pipeline.
Adult/unsafe image uploaded for image-to-image search
Mitigation: Run a moderation pass on any user-uploaded image (AWS Rekognition or Gemini safety classifier) before embedding and search. Reject or sandbox unsafe queries. This protects your catalog index from adversarial inputs and your team from exposure.
Catalog drift - new products added but index not refreshed
Mitigation: Index new products within 5-15 min of catalog write, not nightly. Use Pinecone/Qdrant streaming upserts. Track the index-freshness lag as a SLO and alert when it exceeds 30 min. Stale indexes hurt conversion on new-arrival collections immediately.
Frequently asked questions
CLIP, SigLIP, or Voyage multimodal?
Voyage-multimodal-3 leads on benchmarks in 2026 and is managed API. SigLIP 2 is the best open-source model and self-hostable. OpenCLIP ViT-L/14 is the community default but showing age. For new builds: Voyage if you want managed, SigLIP 2 if you want self-hosted and fine-tuning.
Pinecone, Qdrant, or pgvector?
Pinecone: best managed experience, strong at 10M+ vectors, $70-$500/month at small-medium scale. Qdrant: best self-hosted, great filter performance, harder ops. pgvector: fine if you're already on Postgres and have under 5-10M vectors. Above 100M, Qdrant or Milvus.
Do I need a reranker?
Yes for anything past 1M items. ANN alone returns candidates in the right neighborhood but the top 10 ordering is often off by 20-40%. Cohere Rerank v3 or Voyage Rerank on the top 50 fixes this for $1-$2 per 1k queries. Skip only if you are optimizing for ultra-low latency at small catalogs.
Can I combine text and image in one query?
Yes - 'these shoes but in brown' is the ideal multimodal query. Encode the image, encode the text delta ('in brown'), and average or concatenate with weights. Voyage-multimodal-3 supports this natively. For best results, let the user upload image + type a delta; then fine-tune the weighting on clicks.
How do I evaluate search quality?
Track CTR@5 (did the user click one of the top 5 results?), zero-result rate, and query-to-purchase conversion. Maintain a golden query set of 200-500 labeled queries with expected results. Re-eval on every embedding change, reranker change, or filter rule change. Braintrust is a good fit.
How much does image search cost?
At 500M queries/month with a 100M item catalog, budget $100k-$180k/month all-in (embeddings, vector index, rerank, CDN). Per-query cost: $0.0002-$0.0005. The big cost driver is image storage/CDN, not the search layer.
Should I fine-tune the embedding model?
Only if you have 50k+ in-domain training pairs (image plus relevant text or image-image pairs) and a clear 10%+ accuracy gap. Fine-tuning SigLIP 2 or CLIP on 100k fashion pairs commonly yields 15-25% better top-5 accuracy on fashion catalogs. General stuff? Don't bother - Voyage multimodal is good enough.
Related
Architectures
Video Summarization Pipeline
Reference architecture for turning YouTube videos, meetings, and webinars into chapters, transcripts, and key ...
OCR + Document Understanding Pipeline
Reference architecture for turning scanned documents, invoices, receipts, forms, and handwritten notes into st...
Enterprise Document Search
Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...