RAG Evaluation: Measuring Retrieval and Generation Quality (2026)
RAG systems have two failure points: retrieval (wrong chunks) and generation (hallucinations given correct chunks). Evaluate both separately. For retrieval: measure context precision and context recall. For generation: measure faithfulness (is the answer supported by the context?) and answer relevance (does it answer the question?). RAGAS is the standard framework for automated RAG evaluation in 2026.
When to Use
- ✓Before deploying a RAG system to validate it meets quality thresholds
- ✓After changing chunking strategy, embedding model, or retrieval parameters — to measure impact
- ✓When users report wrong answers — to diagnose whether the issue is retrieval or generation
- ✓Setting up a CI/CD regression gate so new deployments don't degrade answer quality
- ✓Comparing multiple RAG architectures (naive RAG vs. hybrid RAG vs. agentic RAG) on the same dataset
How It Works
- 1Create a golden dataset: 50–200 query/answer/source-document triples. For each query, identify which document chunks contain the answer. This is your ground truth.
- 2Measure retrieval metrics: Context Precision (what fraction of retrieved chunks are relevant?) and Context Recall (what fraction of relevant chunks were retrieved?). Use your golden dataset for ground truth.
- 3Measure generation metrics: Faithfulness (are all claims in the answer supported by the retrieved context?) and Answer Relevance (how well does the answer address the question?). Use LLM-as-judge for both.
- 4RAGAS framework automates all four metrics. Install with pip install ragas. It uses an LLM judge (configurable) to score faithfulness and relevance, and standard IR metrics for retrieval.
- 5Monitor in production with sampling: log 1% of real queries, run them through your eval pipeline, track metric trends over time. Alert if faithfulness drops below 0.85 or context precision drops below 0.6.
Examples
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Your RAG output
data = {
'question': ['What is prompt caching?', ...],
'answer': ['Prompt caching stores computed...', ...],
'contexts': [['Prompt caching is a feature...', 'The cache stores...'], ...],
'ground_truth': ['Prompt caching reduces costs by...', ...]
}
result = evaluate(
Dataset.from_dict(data),
metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
)
print(result)def evaluate_retrieval(queries, ground_truth_chunks, retriever, k=5):
precision_scores = []
recall_scores = []
for query, gt_chunks in zip(queries, ground_truth_chunks):
retrieved = retriever.get_relevant_documents(query)[:k]
retrieved_ids = {r.metadata['chunk_id'] for r in retrieved}
gt_ids = set(gt_chunks)
precision = len(retrieved_ids & gt_ids) / len(retrieved_ids)
recall = len(retrieved_ids & gt_ids) / len(gt_ids)
precision_scores.append(precision)
recall_scores.append(recall)
return {'precision@5': np.mean(precision_scores), 'recall@5': np.mean(recall_scores)}Common Mistakes
- ✗Evaluating only generation quality (final answer) without evaluating retrieval — if retrieval is wrong, no prompt engineering will fix the answer. Always evaluate the two stages separately.
- ✗Small golden datasets (under 30 examples) that produce unstable metrics — a single query can swing precision@5 by 3 percentage points. Use at least 100 examples for reliable benchmarks.
- ✗Using the same LLM for generation and RAGAS evaluation — the judge model should be different from the generator to avoid self-serving bias. If using Claude 3.5 Sonnet for RAG, use GPT-4o as the RAGAS judge.
- ✗Not versioning your eval dataset — if you update the golden dataset, you can't compare metrics across versions. Version control your eval data and never modify existing examples (only append).
FAQ
What RAGAS scores should I target?+
As rough targets: faithfulness >0.90 (answers are grounded in context), answer relevancy >0.85 (answers address the question), context precision >0.75 (retrieved chunks are mostly relevant), context recall >0.70 (most relevant chunks were retrieved). Context recall below 0.60 is a red flag requiring chunking or retrieval improvements.
How do I build a golden dataset efficiently?+
Three approaches: (1) Sample real user queries from production logs and manually annotate. (2) Use an LLM to generate synthetic Q&A pairs from your documents, then manually validate a subset. (3) Use existing domain benchmarks if available. Manual annotation of 100 examples takes 4–8 hours but produces the most reliable evaluations.
Is RAGAS reliable?+
RAGAS is reliable for directional comparisons (is version A better than version B?) but not for absolute quality claims. The LLM judge introduces variance — the same answer can receive different faithfulness scores across runs. Run each eval 3 times and report the mean. For production monitoring, use consistent judge model and temperature.
What should I do when faithfulness is low?+
Low faithfulness means the LLM is adding information not present in the context (hallucinating). Fixes: (1) Add explicit grounding instruction: 'Answer only using the provided context. If the context doesn't contain the answer, say so.' (2) Reduce temperature. (3) Check if the relevant context is actually being retrieved — low faithfulness is often a retrieval problem in disguise.
How do I evaluate RAG on multi-hop questions?+
Multi-hop questions require combining information from multiple chunks. Standard RAGAS doesn't measure multi-hop performance well. Use MultiHopQA-style evaluation: create questions that require 2-3 reasoning steps across documents, and check if all required source chunks are retrieved (not just one). LlamaIndex provides multi-hop evaluation tools.