Reference Architecture · rag
RAG for Codebase Search
Last updated: April 15, 2026
Quick answer
The production stack uses Voyage Code-3 for embeddings, tree-sitter for AST-aware function-level chunking, Qdrant with language and repo metadata filters, and Claude Sonnet 4 for cited answers. Expect $0.05 to $0.15 per query. The biggest quality gain comes from switching to symbol-level chunking — not from changing the LLM.
The problem
Engineers need to query their codebase with natural language: where do we handle refund webhooks, show all Stripe API call sites, how is the User type defined. The system must handle million-line monorepos, understand code semantics beyond token matching, and respond in under 2 seconds.
Architecture
Git Ingester
Clones the repo and walks the file tree. Tracks commits to detect changed files for incremental reindexing.
Alternatives: GitHub API (for cloud), GitLab webhooks, Gitea self-hosted
AST Chunker (tree-sitter)
Splits code at function, class, and interface boundaries using tree-sitter. Preserves symbol names and file paths as metadata.
Alternatives: Recursive character splitter (50% worse recall), Language-specific regex parsers
Code Embedding Model
Converts code chunks and natural-language queries to vectors in the same semantic space.
Alternatives: Nomic Embed Code (self-hostable), OpenAI text-embedding-3-large (weaker on code)
Vector DB
Stores embeddings with metadata: repo, file path, language, symbol name, line range.
Alternatives: pgvector, Pinecone
Hybrid Retriever
Runs dense search plus sparse BM25 in parallel. Merges with reciprocal rank fusion. Metadata filters restrict to requested repo or language.
Alternatives: Dense-only (misses symbol names), BM25-only (misses semantic matches)
Code Reranker
Cross-encoder reranks top 50 code chunks to top 5 most relevant to the query.
Alternatives: Voyage Rerank-2 Code, Cohere Rerank 3
Answer Synthesizer
Generates answer with exact file:line citations from the retrieved code chunks.
Alternatives: GPT-4o, Gemini 2.5 Pro
Query Interface
CLI, IDE plugin, or web interface. Displays the answer with clickable file:line references.
Alternatives: VS Code extension, CLI, Web app
The stack
Native git access gives you commit-level change detection for incremental reindexing. No need for a managed connector at this layer.
Alternatives: GitHub API (for cloud), GitLab webhooks
Splitting at function and class boundaries means each chunk is a complete, meaningful unit of code. Character-based splitting cuts across function boundaries and degrades retrieval by 40 to 50%.
Alternatives: Recursive character splitter, Language-specific regex parsers
Trained on code-query pairs. Outperforms text embedding models by 15 to 25% on code retrieval benchmarks. Self-hostable alternative: Nomic Embed Code.
Alternatives: Nomic Embed Code (self-hostable), OpenAI text-embedding-3-large
Metadata filtering on language, repo, and file path is fast and built in. pgvector works for single-repo deployments under 2M chunks.
Alternatives: pgvector, Pinecone
Adds 150ms and lifts answer relevance by 30% on code Q&A. Worth the latency at developer-tool latency budgets.
Alternatives: Cohere Rerank 3, Jina Reranker
Best at following file:line citation formats and distinguishing between similar symbol names. Rarely conflates methods from different classes.
Alternatives: GPT-4o, Gemini 2.5 Pro
Cost at each scale
Prototype
1 repo · 100k LOC · 1k queries/mo
$60/mo
Startup
10 repos · 5M LOC · 50k queries/mo
$1,800/mo
Scale
1000 repos · 500M LOC · 2M queries/mo
$28,000/mo
Latency budget
Tradeoffs
AST chunking vs character chunking
Character-based splitting at 512 tokens cuts across function boundaries, producing chunks that lack context. tree-sitter chunking at symbol boundaries improves retrieval recall by 40 to 50% on real codebases. The extra setup time is worth it.
Code embeddings vs text embeddings
Text embedding models like OpenAI text-embedding-3-large work, but underperform by 15 to 25% on code retrieval. If you have access to Voyage Code-3 or Nomic Embed Code, use them. The quality difference is measurable.
API embeddings vs self-hosted
Voyage Code-3 via API costs about $0.00012 per 1k tokens and outperforms most self-hosted alternatives. Nomic Embed Code is competitive and free to self-host if data residency matters.
Failure modes & guardrails
Index is stale after merge
Mitigation: Track the last-indexed commit hash per file. On each repo event, reindex only changed files. For monorepos with millions of files, full reindexes are too slow — incremental is a requirement.
Large files overflow context
Mitigation: Cap chunk size at 800 tokens. For files above 5000 lines (generated code, fixtures), skip indexing entirely unless explicitly included.
Multi-file context breaks answers
Mitigation: Build a lightweight import graph alongside the vector index. When the top chunk imports from another file, add the imported symbol's chunk to the context automatically.
LLM fabricates file paths
Mitigation: Post-process every file path the LLM emits. Validate each path against the actual repo file tree. Reject answers that reference non-existent paths.
Poor recall on non-English code comments
Mitigation: Run a language-detection pass at ingest. For non-English comment-heavy code, add a machine-translated English summary to the chunk metadata before embedding.
Frequently asked questions
How much does codebase RAG cost at scale?
At 1000 repos and 2M queries per month, budget $25k to $30k per month. The LLM synthesis step dominates at around 50% of total cost. Prompt caching reduces this by 20 to 30% on repeated codebases.
Which embedding model is best for code?
Voyage Code-3 leads code retrieval benchmarks in 2026. Nomic Embed Code is the best self-hostable option and is competitive with Voyage. OpenAI text-embedding-3-large works but underperforms by 15 to 25% on code-specific queries.
How does Cursor do codebase search?
Cursor uses a combination of symbol-level chunking, code-aware embeddings, and local indexing with a background sync daemon. They index only changed files on each save to keep the index fresh without expensive full reindexes.
Why use tree-sitter instead of plain text chunking?
Plain character-based chunking cuts through function boundaries, returning fragments that lack the signature, body, and context needed to answer questions accurately. tree-sitter splits at symbol boundaries — functions, classes, interfaces — so each chunk is a complete semantic unit. Retrieval recall improves 40 to 50%.
Should I use a managed API for embeddings or self-host?
For most teams, Voyage Code-3 via API is the right choice: no infrastructure to run, better quality, and low cost at $0.00012 per 1k tokens. Self-host Nomic Embed Code if you have data residency requirements or are processing more than 500M tokens per month.
How do I handle a monorepo with millions of files?
Index only source files — exclude generated code, test fixtures, and vendored dependencies. Use incremental reindexing keyed on commit hashes so only changed files are reembedded. A monorepo with 1M source files typically produces 3 to 5M chunks after filtering.