Reference Architecture · rag

RAG for Codebase Search

Last updated: April 15, 2026

Quick answer

The production stack uses Voyage Code-3 for embeddings, tree-sitter for AST-aware function-level chunking, Qdrant with language and repo metadata filters, and Claude Sonnet 4 for cited answers. Expect $0.05 to $0.15 per query. The biggest quality gain comes from switching to symbol-level chunking — not from changing the LLM.

The problem

Engineers need to query their codebase with natural language: where do we handle refund webhooks, show all Stripe API call sites, how is the User type defined. The system must handle million-line monorepos, understand code semantics beyond token matching, and respond in under 2 seconds.

Architecture

input

llm

data

infra

output

Git Ingester

Clones the repo and walks the file tree. Tracks commits to detect changed files for incremental reindexing.

Alternatives: GitHub API (for cloud), GitLab webhooks, Gitea self-hosted

AST Chunker (tree-sitter)

Splits code at function, class, and interface boundaries using tree-sitter. Preserves symbol names and file paths as metadata.

Alternatives: Recursive character splitter (50% worse recall), Language-specific regex parsers

Code Embedding Model

Converts code chunks and natural-language queries to vectors in the same semantic space.

Alternatives: Nomic Embed Code (self-hostable), OpenAI text-embedding-3-large (weaker on code)

Vector DB

Stores embeddings with metadata: repo, file path, language, symbol name, line range.

Alternatives: pgvector, Pinecone

Hybrid Retriever

Runs dense search plus sparse BM25 in parallel. Merges with reciprocal rank fusion. Metadata filters restrict to requested repo or language.

Alternatives: Dense-only (misses symbol names), BM25-only (misses semantic matches)

Code Reranker

Cross-encoder reranks top 50 code chunks to top 5 most relevant to the query.

Alternatives: Voyage Rerank-2 Code, Cohere Rerank 3

Answer Synthesizer

Generates answer with exact file:line citations from the retrieved code chunks.

Alternatives: GPT-4o, Gemini 2.5 Pro

Query Interface

CLI, IDE plugin, or web interface. Displays the answer with clickable file:line references.

Alternatives: VS Code extension, CLI, Web app

The stack

Ingestiongit clone + custom file walker

Native git access gives you commit-level change detection for incremental reindexing. No need for a managed connector at this layer.

Alternatives: GitHub API (for cloud), GitLab webhooks

Chunkingtree-sitter (symbol-level)

Splitting at function and class boundaries means each chunk is a complete, meaningful unit of code. Character-based splitting cuts across function boundaries and degrades retrieval by 40 to 50%.

Alternatives: Recursive character splitter, Language-specific regex parsers

Code embeddingsVoyage Code-3

Trained on code-query pairs. Outperforms text embedding models by 15 to 25% on code retrieval benchmarks. Self-hostable alternative: Nomic Embed Code.

Alternatives: Nomic Embed Code (self-hostable), OpenAI text-embedding-3-large

Vector DBQdrant

Metadata filtering on language, repo, and file path is fast and built in. pgvector works for single-repo deployments under 2M chunks.

Alternatives: pgvector, Pinecone

RerankerVoyage Rerank-2

Adds 150ms and lifts answer relevance by 30% on code Q&A. Worth the latency at developer-tool latency budgets.

Alternatives: Cohere Rerank 3, Jina Reranker

Answer LLMClaude Sonnet 4

Best at following file:line citation formats and distinguishing between similar symbol names. Rarely conflates methods from different classes.

Alternatives: GPT-4o, Gemini 2.5 Pro

Cost at each scale

Prototype

1 repo · 100k LOC · 1k queries/mo

$60/mo

Embedding one-time (Voyage Code-3)$15

Query embeddings (1k/mo)$1

Reranking (1k queries, top-50)$5

Claude Sonnet 4 synthesis$20

Qdrant Cloud (free tier)$0

Hosting$19

Startup

10 repos · 5M LOC · 50k queries/mo

$1,800/mo

Incremental reindex (Voyage Code-3)$30

Query embeddings (50k/mo)$25

Reranking (50k queries, top-50)$80

Claude Sonnet 4 synthesis$900

Qdrant Cloud (standard)$400

Infra + observability$365

Scale

1000 repos · 500M LOC · 2M queries/mo

$28,000/mo

Reindex pipeline (ongoing)$2,000

Query embeddings (2M/mo)$800

Reranking (2M queries, top-50)$3,200

Claude Sonnet 4 synthesis$14,000

Qdrant Enterprise$4,000

Infra + observability$4,000

Latency budget

Total P50: 1,400ms

Total P95: 2,420ms

Git ingestion + parse (offline, async)

0ms · 0ms p95

Query embedding

60ms · 140ms p95

Hybrid retrieval (top-50)

90ms · 200ms p95

Reranking to top-5

150ms · 280ms p95

LLM answer synthesis (streamed)

1100ms · 1800ms p95

Median

P95

Tradeoffs

AST chunking vs character chunking

Character-based splitting at 512 tokens cuts across function boundaries, producing chunks that lack context. tree-sitter chunking at symbol boundaries improves retrieval recall by 40 to 50% on real codebases. The extra setup time is worth it.

Code embeddings vs text embeddings

Text embedding models like OpenAI text-embedding-3-large work, but underperform by 15 to 25% on code retrieval. If you have access to Voyage Code-3 or Nomic Embed Code, use them. The quality difference is measurable.

API embeddings vs self-hosted

Voyage Code-3 via API costs about $0.00012 per 1k tokens and outperforms most self-hosted alternatives. Nomic Embed Code is competitive and free to self-host if data residency matters.

Failure modes & guardrails

Index is stale after merge

Mitigation: Track the last-indexed commit hash per file. On each repo event, reindex only changed files. For monorepos with millions of files, full reindexes are too slow — incremental is a requirement.

Large files overflow context

Mitigation: Cap chunk size at 800 tokens. For files above 5000 lines (generated code, fixtures), skip indexing entirely unless explicitly included.

Multi-file context breaks answers

Mitigation: Build a lightweight import graph alongside the vector index. When the top chunk imports from another file, add the imported symbol's chunk to the context automatically.

LLM fabricates file paths

Mitigation: Post-process every file path the LLM emits. Validate each path against the actual repo file tree. Reject answers that reference non-existent paths.

Poor recall on non-English code comments

Mitigation: Run a language-detection pass at ingest. For non-English comment-heavy code, add a machine-translated English summary to the chunk metadata before embedding.

Frequently asked questions

How much does codebase RAG cost at scale?

At 1000 repos and 2M queries per month, budget $25k to $30k per month. The LLM synthesis step dominates at around 50% of total cost. Prompt caching reduces this by 20 to 30% on repeated codebases.

Which embedding model is best for code?

Voyage Code-3 leads code retrieval benchmarks in 2026. Nomic Embed Code is the best self-hostable option and is competitive with Voyage. OpenAI text-embedding-3-large works but underperforms by 15 to 25% on code-specific queries.

How does Cursor do codebase search?

Cursor uses a combination of symbol-level chunking, code-aware embeddings, and local indexing with a background sync daemon. They index only changed files on each save to keep the index fresh without expensive full reindexes.

Why use tree-sitter instead of plain text chunking?

Plain character-based chunking cuts through function boundaries, returning fragments that lack the signature, body, and context needed to answer questions accurately. tree-sitter splits at symbol boundaries — functions, classes, interfaces — so each chunk is a complete semantic unit. Retrieval recall improves 40 to 50%.

Should I use a managed API for embeddings or self-host?

For most teams, Voyage Code-3 via API is the right choice: no infrastructure to run, better quality, and low cost at $0.00012 per 1k tokens. Self-host Nomic Embed Code if you have data residency requirements or are processing more than 500M tokens per month.

How do I handle a monorepo with millions of files?

Index only source files — exclude generated code, test fixtures, and vendored dependencies. Use incremental reindexing keyed on commit hashes so only changed files are reembedded. A monorepo with 1M source files typically produces 3 to 5M chunks after filtering.

Architectures

Enterprise Document Search

Reference architecture for semantic search across 1M+ enterprise documents (PDFs, Confluence, Notion, Google D...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 gpt-4o

Tools mentioned

qdrant pgvector