Production RAG Checklist 2026: 42 Things to Do Before You Ship

Shipping a RAG demo is easy. Shipping a RAG system that works reliably in production — that handles edge cases, monitors itself, stays within cost budgets, and degrades gracefully — is a different challenge entirely. This checklist covers everything you need before you go live.

Quick Answer

Most RAG systems fail in production because teams skip evaluation, observability, and failure handling. Run this checklist before launch, not after.

Section 1: Data and Chunking

[ ] 1. Chunking strategy is validated against your retrieval metrics. Fixed 512-token chunks are a starting point, not a solution. Run hit-rate evals on at least 100 test questions before deciding.

[ ] 2. Chunk overlap is set appropriately. Overlap of 10-15% prevents information loss at boundaries. Too much overlap increases index size and cost.

[ ] 3. Metadata is stored with each chunk. Source URL, document ID, section title, creation date, last modified date. You'll need this for filtering and debugging.

[ ] 4. Document parsing is tested on all file types in scope. PDF extraction fails differently than DOCX. Tables are particularly problematic — test tables explicitly.

[ ] 5. Chunk quality has been manually reviewed. Sample 50 chunks randomly. If you find chunks that are incomplete sentences, HTML artifacts, or pure whitespace, your parser needs work.

[ ] 6. A document freshness strategy is defined. How do you handle updated documents? Full re-index? Delta updates? Documents with stale embeddings will cause retrieval drift.

[ ] 7. Index size is monitored and projected. Embedding storage costs money. Project growth and set a budget alert at 80% of your planned index size.

Section 2: Embedding and Retrieval

[ ] 8. Embedding model is selected based on benchmarks, not hype. Test at least two embedding models on your domain. BGE-M3, text-embedding-3-large, and Cohere embed-v3 perform differently across domains.

[ ] 9. Hit rate at k=5 is above 0.85 on your eval set. If it's not, don't ship. Improve chunking or embedding strategy first.

[ ] 10. Hybrid search (dense + BM25) is enabled if your queries contain exact terms. Product IDs, error codes, person names — pure vector search fails on these.

[ ] 11. A reranker is evaluated. Cohere rerank-v3.5 or a cross-encoder typically improves NDCG@5 by 5-10%. Worth the $20-50/day for most production systems.

[ ] 12. Embedding batch sizes are tuned for throughput. Don't embed documents one at a time. Batch size 96-256 is usually optimal for most embedding APIs.

[ ] 13. ANN index parameters are tuned (ef, m for HNSW). Default parameters are not optimal. Higher ef improves recall at the cost of latency. Test the tradeoff.

[ ] 14. Metadata filters are used to scope retrieval. Don't search the entire index if you can filter by department, product line, or date range. This reduces noise and improves precision.

Section 3: Generation

[ ] 15. System prompt is explicit about using only the provided context. "Answer based only on the provided context. If the information is not in the context, say so." This reduces hallucination significantly.

[ ] 16. Context formatting is clean. Number each chunk, include source identifiers. "[Doc 1]: ..." helps the model attribute and reason about multiple sources.

[ ] 17. max_tokens is set appropriately. Don't leave it at the model default. Estimate your expected output length and set max_tokens to 1.5x that value.

[ ] 18. Temperature is set for your use case. Factual QA: temperature 0.0-0.3. Creative/conversational: 0.5-0.8. High temperature on factual tasks increases hallucination.

[ ] 19. Faithfulness score is above 0.85 on eval set. If the model is making up facts not in the context, the prompt needs work.

[ ] 20. Answer relevancy score is above 0.85 on eval set. If answers don't address the question, check for retrieval failures and prompt clarity.

[ ] 21. Out-of-scope queries are handled gracefully. What happens when a user asks something completely outside your knowledge base? The system should say so clearly, not hallucinate.

[ ] 22. Conflicting information in retrieved chunks is handled. Two chunks may contradict each other. The prompt should instruct the model to acknowledge the conflict.

Section 4: Latency and Cost

[ ] 23. End-to-end P50 and P99 latency is measured. Embedding query + retrieval + generation. P50 < 1s is achievable. P99 < 3s for most use cases.

[ ] 24. Embedding latency is under 100ms for real-time queries. Batching kills latency. For real-time, use a fast embedding API or local model.

[ ] 25. Prompt caching is enabled where applicable. If your system prompt is > 1,024 tokens (Anthropic) or 1,024 tokens (OpenAI), use prompt caching. 50-90% cost reduction on the static portion.

[ ] 26. Cost per query is calculated and within budget. Embedding + retrieval + LLM inference. Calculate at 10K, 100K, 1M queries/day. The number often surprises teams.

[ ] 27. Response caching is implemented for repeated queries. A Redis cache for exact-match queries or semantic deduplication saves real money on production loads.

[ ] 28. Model routing is considered. Route simple queries to a cheap model (Gemini Flash at $0.10/1M, GPT-4o-mini at $0.15/1M). Reserve expensive models for complex queries.

Section 5: Observability

[ ] 29. Every query is logged — query, retrieved chunks (with scores), generated answer, latency breakdown, token counts.

[ ] 30. Retrieval quality signals are captured. Log retrieval scores. Set an alert when average retrieval score drops (suggests index drift or embedding model mismatch).

[ ] 31. User feedback is captured. At minimum: thumbs up/down per response. This is your most valuable signal for detecting silent failures.

[ ] 32. Cost monitoring is set up. Per-query cost tracked, daily budget alerts configured. LLM costs can spike 10x with unexpected usage patterns.

[ ] 33. A dashboard shows key metrics. Hit rate, faithfulness score (sampled), latency P50/P99, cost per query, feedback ratio. Check it daily.

[ ] 34. Failed queries are flagged and reviewed. Define failure criteria (user gave negative feedback, system returned "I don't know," hallucination detected). Sample and review these weekly.

Section 6: Security and Data Governance

[ ] 35. Access control is enforced at the chunk level. Can user A retrieve chunks that user B shouldn't see? If your knowledge base has permissions, the retrieval layer must enforce them.

[ ] 36. Sensitive data is scrubbed before indexing. PII, secrets, internal pricing, draft content. Use a pre-indexing scan tool (Presidio, custom regex) to catch obvious issues.

[ ] 37. Prompt injection risks are understood. If users can upload documents that get indexed, those documents could contain adversarial instructions. Sanitize or isolate user-uploaded content.

[ ] 38. LLM provider data retention policies are reviewed. Does your provider train on your API calls? Check and configure accordingly (OpenAI zero data retention, Anthropic API policy).

Section 7: Failure Handling

[ ] 39. Retrieval failures return a graceful fallback. If the vector DB is down, the system should fail clearly rather than hallucinate from no context.

[ ] 40. LLM API failures have retry logic with exponential backoff. Rate limit errors (429) should be retried automatically. Timeouts should trigger a fallback model or graceful degradation.

[ ] 41. The system handles empty retrieval gracefully. When no chunks are retrieved (no relevant content in the index), the response should say so, not attempt to answer from parametric memory.

[ ] 42. Load testing is completed at 2x expected peak traffic. Include embedding, retrieval, and generation in the load test. Find your bottleneck before your users do.

Pre-Launch Scorecard

Count your checked boxes:

35-42: Production-ready. Ship with confidence.
25-34: Acceptable for internal beta. Fix the gaps before public launch.
15-24: Demo quality. Not ready for production.
<15: Back to the drawing board.

The items teams most commonly skip: #10 (hybrid search), #21 (out-of-scope handling), #29-34 (observability), and #39-41 (failure handling). These are also the ones that cause the worst production incidents.