Reference Architecture · rag

RAG for Codebase Search

Last updated: April 15, 2026

Quick answer

The production stack uses Voyage Code-3 for embeddings, tree-sitter for AST-aware function-level chunking, Qdrant with language and repo metadata filters, and Claude Sonnet 4 for cited answers. Expect $0.05 to $0.15 per query. The biggest quality gain comes from switching to symbol-level chunking — not from changing the LLM.

The problem

Engineers need to query their codebase with natural language: where do we handle refund webhooks, show all Stripe API call sites, how is the User type defined. The system must handle million-line monorepos, understand code semantics beyond token matching, and respond in under 2 seconds.

Architecture

answer + citationsGit IngesterINPUTAST Chunker (tree-sitter)DATACode Embedding ModelLLMVector DBDATAHybrid RetrieverINFRACode RerankerLLMAnswer SynthesizerLLMQuery InterfaceOUTPUT
input
llm
data
infra
output

Git Ingester

Clones the repo and walks the file tree. Tracks commits to detect changed files for incremental reindexing.

Alternatives: GitHub API (for cloud), GitLab webhooks, Gitea self-hosted

AST Chunker (tree-sitter)

Splits code at function, class, and interface boundaries using tree-sitter. Preserves symbol names and file paths as metadata.

Alternatives: Recursive character splitter (50% worse recall), Language-specific regex parsers

Code Embedding Model

Converts code chunks and natural-language queries to vectors in the same semantic space.

Alternatives: Nomic Embed Code (self-hostable), OpenAI text-embedding-3-large (weaker on code)

Vector DB

Stores embeddings with metadata: repo, file path, language, symbol name, line range.

Alternatives: pgvector, Pinecone

Hybrid Retriever

Runs dense search plus sparse BM25 in parallel. Merges with reciprocal rank fusion. Metadata filters restrict to requested repo or language.

Alternatives: Dense-only (misses symbol names), BM25-only (misses semantic matches)

Code Reranker

Cross-encoder reranks top 50 code chunks to top 5 most relevant to the query.

Alternatives: Voyage Rerank-2 Code, Cohere Rerank 3

Answer Synthesizer

Generates answer with exact file:line citations from the retrieved code chunks.

Alternatives: GPT-4o, Gemini 2.5 Pro

Query Interface

CLI, IDE plugin, or web interface. Displays the answer with clickable file:line references.

Alternatives: VS Code extension, CLI, Web app

The stack

Ingestiongit clone + custom file walker

Native git access gives you commit-level change detection for incremental reindexing. No need for a managed connector at this layer.

Alternatives: GitHub API (for cloud), GitLab webhooks

Chunkingtree-sitter (symbol-level)

Splitting at function and class boundaries means each chunk is a complete, meaningful unit of code. Character-based splitting cuts across function boundaries and degrades retrieval by 40 to 50%.

Alternatives: Recursive character splitter, Language-specific regex parsers

Code embeddingsVoyage Code-3

Trained on code-query pairs. Outperforms text embedding models by 15 to 25% on code retrieval benchmarks. Self-hostable alternative: Nomic Embed Code.

Alternatives: Nomic Embed Code (self-hostable), OpenAI text-embedding-3-large

Vector DBQdrant

Metadata filtering on language, repo, and file path is fast and built in. pgvector works for single-repo deployments under 2M chunks.

Alternatives: pgvector, Pinecone

RerankerVoyage Rerank-2

Adds 150ms and lifts answer relevance by 30% on code Q&A. Worth the latency at developer-tool latency budgets.

Alternatives: Cohere Rerank 3, Jina Reranker

Answer LLMClaude Sonnet 4

Best at following file:line citation formats and distinguishing between similar symbol names. Rarely conflates methods from different classes.

Alternatives: GPT-4o, Gemini 2.5 Pro

Cost at each scale

Prototype

1 repo · 100k LOC · 1k queries/mo

$60/mo

Embedding one-time (Voyage Code-3)$15
Query embeddings (1k/mo)$1
Reranking (1k queries, top-50)$5
Claude Sonnet 4 synthesis$20
Qdrant Cloud (free tier)$0
Hosting$19

Startup

10 repos · 5M LOC · 50k queries/mo

$1,800/mo

Incremental reindex (Voyage Code-3)$30
Query embeddings (50k/mo)$25
Reranking (50k queries, top-50)$80
Claude Sonnet 4 synthesis$900
Qdrant Cloud (standard)$400
Infra + observability$365

Scale

1000 repos · 500M LOC · 2M queries/mo

$28,000/mo

Reindex pipeline (ongoing)$2,000
Query embeddings (2M/mo)$800
Reranking (2M queries, top-50)$3,200
Claude Sonnet 4 synthesis$14,000
Qdrant Enterprise$4,000
Infra + observability$4,000

Latency budget

Total P50: 1,400ms
Total P95: 2,420ms
Git ingestion + parse (offline, async)
0ms · 0ms p95
Query embedding
60ms · 140ms p95
Hybrid retrieval (top-50)
90ms · 200ms p95
Reranking to top-5
150ms · 280ms p95
LLM answer synthesis (streamed)
1100ms · 1800ms p95
Median
P95

Tradeoffs

AST chunking vs character chunking

Character-based splitting at 512 tokens cuts across function boundaries, producing chunks that lack context. tree-sitter chunking at symbol boundaries improves retrieval recall by 40 to 50% on real codebases. The extra setup time is worth it.

Code embeddings vs text embeddings

Text embedding models like OpenAI text-embedding-3-large work, but underperform by 15 to 25% on code retrieval. If you have access to Voyage Code-3 or Nomic Embed Code, use them. The quality difference is measurable.

API embeddings vs self-hosted

Voyage Code-3 via API costs about $0.00012 per 1k tokens and outperforms most self-hosted alternatives. Nomic Embed Code is competitive and free to self-host if data residency matters.

Failure modes & guardrails

Index is stale after merge

Mitigation: Track the last-indexed commit hash per file. On each repo event, reindex only changed files. For monorepos with millions of files, full reindexes are too slow — incremental is a requirement.

Large files overflow context

Mitigation: Cap chunk size at 800 tokens. For files above 5000 lines (generated code, fixtures), skip indexing entirely unless explicitly included.

Multi-file context breaks answers

Mitigation: Build a lightweight import graph alongside the vector index. When the top chunk imports from another file, add the imported symbol's chunk to the context automatically.

LLM fabricates file paths

Mitigation: Post-process every file path the LLM emits. Validate each path against the actual repo file tree. Reject answers that reference non-existent paths.

Poor recall on non-English code comments

Mitigation: Run a language-detection pass at ingest. For non-English comment-heavy code, add a machine-translated English summary to the chunk metadata before embedding.

Frequently asked questions

How much does codebase RAG cost at scale?

At 1000 repos and 2M queries per month, budget $25k to $30k per month. The LLM synthesis step dominates at around 50% of total cost. Prompt caching reduces this by 20 to 30% on repeated codebases.

Which embedding model is best for code?

Voyage Code-3 leads code retrieval benchmarks in 2026. Nomic Embed Code is the best self-hostable option and is competitive with Voyage. OpenAI text-embedding-3-large works but underperforms by 15 to 25% on code-specific queries.

How does Cursor do codebase search?

Cursor uses a combination of symbol-level chunking, code-aware embeddings, and local indexing with a background sync daemon. They index only changed files on each save to keep the index fresh without expensive full reindexes.

Why use tree-sitter instead of plain text chunking?

Plain character-based chunking cuts through function boundaries, returning fragments that lack the signature, body, and context needed to answer questions accurately. tree-sitter splits at symbol boundaries — functions, classes, interfaces — so each chunk is a complete semantic unit. Retrieval recall improves 40 to 50%.

Should I use a managed API for embeddings or self-host?

For most teams, Voyage Code-3 via API is the right choice: no infrastructure to run, better quality, and low cost at $0.00012 per 1k tokens. Self-host Nomic Embed Code if you have data residency requirements or are processing more than 500M tokens per month.

How do I handle a monorepo with millions of files?

Index only source files — exclude generated code, test fixtures, and vendored dependencies. Use incremental reindexing keyed on commit hashes so only changed files are reembedded. A monorepo with 1M source files typically produces 3 to 5M chunks after filtering.

Related

Models mentioned

Tools mentioned