How to Evaluate RAG Systems in 2026: Metrics, Frameworks, and Real Benchmarks
Most teams ship RAG pipelines without proper evaluation. They eyeball 20 examples, declare it "good enough," and discover six months later that the system hallucinates on 30% of real user queries. This guide covers how to build a proper eval system for RAG — one that catches problems before production.
The Core RAG Evaluation Problem
RAG has two distinct failure modes, and you need to measure both:
- Retrieval failures — The right chunks aren't retrieved. The model never had the information to answer correctly.
- Generation failures — The right chunks were retrieved, but the model generated an incorrect or hallucinated answer anyway.
Most teams only measure final answer quality and miss retrieval failures entirely. This is a mistake.
The Four Core RAG Metrics
1. Context Recall
What it measures: Did the retrieved chunks contain the information needed to answer the question?Formula: Fraction of ground-truth answer statements that can be attributed to the retrieved context.
Target: > 0.85 for production systems
How to measure:
def context_recall(question, ground_truth_answer, retrieved_chunks, llm):
prompt = f"""
For each statement in the answer, determine if it can be attributed to the context.
Answer: {ground_truth_answer}
Context: {retrieved_chunks}
Return: fraction of statements supported by context (0.0 to 1.0)"""
return float(llm.generate(prompt))
2. Context Precision
What it measures: Are the retrieved chunks actually relevant, or is there noise in the retrieved context?Formula: Fraction of retrieved chunks that are actually relevant to answering the question.
Target: > 0.80
Low context precision means you're filling the context window with irrelevant chunks, increasing cost and confusing the model.
3. Faithfulness
What it measures: Does the generated answer make claims supported by the retrieved context? This is your hallucination detector.Formula: Fraction of answer statements that are directly supported by the retrieved context.
Target: > 0.90 for factual QA
def faithfulness(answer, retrieved_chunks, llm):
# Extract claims from answer
claims = extract_claims(answer, llm)
# Verify each claim against context
supported = 0
for claim in claims:
is_supported = verify_claim(claim, retrieved_chunks, llm)
supported += int(is_supported)
return supported / len(claims) if claims else 1.0
4. Answer Relevancy
What it measures: Does the answer actually address the question asked?Formula: Semantic similarity between the question and generated answer (using embeddings).
Target: > 0.85
This catches answers that are factually correct but don't answer the actual question.
RAGAS: The Standard Eval Framework
RAGAS (Retrieval Augmented Generation Assessment) has become the de facto standard for RAG evaluation. It implements all four metrics above and handles the LLM-as-judge scoring.
pip install ragas
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)
from datasets import Dataset
# Your eval dataset
data = {
"question": ["What is the refund policy?", "How do I cancel my subscription?"],
"answer": [generated_answers], # Your RAG system's outputs
"contexts": [retrieved_chunks_list], # List of lists
"ground_truth": [correct_answers], # Your labeled answers
}
dataset = Dataset.from_dict(data)
results = evaluate(
dataset,
metrics=[faithfulness, answer_relevancy, context_recall, context_precision]
)
print(results)
# Output:
# {'faithfulness': 0.87, 'answer_relevancy': 0.91,
# 'context_recall': 0.83, 'context_precision': 0.79}
Building Your Eval Dataset
The hardest part of RAG evaluation is building the test dataset. Options:
Option 1: Manual golden set (best quality)
Have domain experts write 100-200 question/answer pairs from your actual knowledge base. Slow but the highest signal.Option 2: Synthetic generation (fastest)
from ragas.testset.generator import TestsetGenerator
from ragas.testset.evolutions import simple, reasoning, multi_context
generator = TestsetGenerator.with_openai()
testset = generator.generate_with_langchain_docs(
documents,
test_size=100,
distributions={simple: 0.5, reasoning: 0.25, multi_context: 0.25}
)
Synthetic datasets are fast but have a known problem: the LLM generates questions it can answer, missing the hard edge cases.
Option 3: Production query mining (most realistic)
Log real user queries, have 5% human-reviewed for ground truth answers. Slow to build but captures actual distribution of difficulty.Recommendation: Start with 50 synthetic + 50 manually selected questions. Expand with production queries over time.
Retrieval-Specific Evaluation
Evaluate your retriever independently before evaluating end-to-end:
Hit Rate
For each test question, does the correct chunk appear in the top-K retrieved results?def hit_rate(questions, correct_chunk_ids, retriever, k=5):
hits = 0
for question, correct_id in zip(questions, correct_chunk_ids):
retrieved_ids = [c.id for c in retriever.search(question, k=k)]
if correct_id in retrieved_ids:
hits += 1
return hits / len(questions)
Target: > 0.90 at k=5
Mean Reciprocal Rank (MRR)
Not just whether the correct chunk is retrieved, but how high it ranks.def mrr(questions, correct_chunk_ids, retriever, k=10):
total_rr = 0
for question, correct_id in zip(questions, correct_chunk_ids):
retrieved_ids = [c.id for c in retriever.search(question, k=k)]
for rank, rid in enumerate(retrieved_ids, 1):
if rid == correct_id:
total_rr += 1 / rank
break
return total_rr / len(questions)
Target: > 0.75
Chunking Strategy Evaluation
Chunking decisions dramatically affect retrieval quality. Evaluate different strategies:
| Strategy | Chunk Size | Overlap | Hit Rate (typical) | Cost per Query |
| Fixed size | 512 tokens | 50 | 0.78 | Low |
| Fixed size | 256 tokens | 25 | 0.81 | Low |
| Sentence splitting | ~100 tokens | 0 | 0.74 | Low |
| Semantic chunking | Variable | 0 | 0.88 | Medium |
| Hierarchical | Variable | 0 | 0.91 | High |
Semantic chunking (splitting at semantic boundaries rather than fixed token counts) consistently outperforms fixed-size chunking by 5-15% on hit rate.
Setting Up Continuous Evaluation
Point-in-time evaluation is not enough. You need continuous eval that runs on every pipeline change:
# GitHub Actions workflow
name: RAG Eval
on:
push:
paths:
- 'rag/**'
- 'data/**'
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run RAG eval
run: python scripts/eval_rag.py
- name: Check thresholds
run: |
python -c "
import json
results = json.load(open('eval_results.json'))
assert results['faithfulness'] > 0.85, f\"Faithfulness {results['faithfulness']} below threshold\"
assert results['context_recall'] > 0.80, f\"Context recall {results['context_recall']} below threshold\"
print('All thresholds passed')
"
Common RAG Failure Patterns (and How to Detect Them)
Pattern 1: The Confident Wrong Answer
Faithfulness score is high but answer relevancy is low. The model faithfully uses the context but retrieves the wrong context. Fix: Improve embedding model or add metadata filtering.Pattern 2: The Hedged Non-Answer
Context recall is high but answer relevancy is low. The model retrieves the right info but responds with "I'm not sure" or gives a vague answer. Fix: Adjust system prompt to encourage direct answers; check if model temperature is too high.Pattern 3: Context Confusion
Context precision is low. You're retrieving too many chunks and the model gets confused. Fix: Reduce top-K, add reranking, improve metadata filtering.Pattern 4: The Staleness Problem
Hit rate drops over time. New content is added but not re-indexed, or the embedding model was changed without re-indexing. Fix: Build an indexing freshness monitor; alert when content age > threshold.Benchmarks: What Good Looks Like
Based on production systems in 2026:
| Use Case | Faithfulness | Context Recall | Answer Relevancy | Context Precision |
| Customer Support (simple FAQ) | 0.94 | 0.89 | 0.93 | 0.87 |
| Technical Documentation | 0.88 | 0.82 | 0.86 | 0.80 |
| Legal/Compliance | 0.97 | 0.91 | 0.89 | 0.84 |
| Code Search | 0.85 | 0.78 | 0.82 | 0.74 |
| Multi-hop QA | 0.79 | 0.71 | 0.77 | 0.68 |
Multi-hop questions (those requiring combining information from multiple documents) are significantly harder and should be tracked separately.
Quick Eval Checklist
Before shipping any RAG system:
- [ ] Minimum 100-question eval dataset (50+ manually selected)
- [ ] Retrieval metrics: hit rate > 0.85 at k=5, MRR > 0.70
- [ ] Generation metrics: faithfulness > 0.85, answer relevancy > 0.85
- [ ] Tested on adversarial queries (out-of-scope questions, ambiguous queries)
- [ ] CI pipeline runs eval on every relevant code change
- [ ] Production logging captures user feedback signals (thumbs up/down)
- [ ] Defined thresholds for prod alerts when metrics degrade
Methodology
All benchmarks, pricing, and performance figures cited in this article are sourced from publicly available data: provider pricing pages (verified 2026-04-16), LMSYS Chatbot Arena ELO leaderboard, MTEB retrieval benchmark, and independent API tests. Costs are listed as per-million-token input/output unless noted. Rankings reflect the publication date and change as models update.