Building an Evals Framework for LLM Applications (2026)
An evals framework is the testing infrastructure for LLM applications. It consists of: a dataset of input/expected output pairs, eval functions that score actual outputs, a runner that executes evals at scale, and a dashboard that tracks metrics over time. Unlike unit tests, evals accept imperfect scores — the goal is to detect regressions (score drops), not achieve 100%.
When to Use
- ✓Before shipping any LLM feature to production — validate it meets quality thresholds
- ✓After every prompt change, model upgrade, or retrieval system change — catch regressions
- ✓When debugging user-reported quality issues — run targeted evals to reproduce and measure the problem
- ✓Comparing two approaches (different models, prompts, RAG strategies) to make a data-driven choice
- ✓Setting up ongoing quality monitoring for production LLM applications
How It Works
- 1Structure evals in three layers: unit evals (single input/output pair), scenario evals (multi-turn conversation), and end-to-end evals (full task completion). Start with unit evals and add complexity as needed.
- 2Eval functions score a single output: exact match, substring match, regex match, JSON schema validation, LLM-as-judge, or custom programmatic checks. Each eval function returns a score (0-1) and optional metadata.
- 3Build a dataset of (input, expected output, metadata) triples. Start with 50–100 examples per major use case. Add failure cases as you find them. Never delete examples — only append.
- 4Run evals against multiple model versions simultaneously using a configuration system. This enables side-by-side comparison and ensures you only deploy improvements.
- 5CI integration: run fast evals (under 2 minutes) on every PR. Run full evals (30+ minutes) on main branch merges. Block deploys if score drops more than 5% on any critical dimension.
Examples
import braintrust
from braintrust import Eval
from autoevals import Factuality, LLMClassifier
# Define your eval
Eval(
'customer-support-qa',
data=lambda: [
{'input': {'query': 'How do I cancel my subscription?'},
'expected': 'cancel subscription instructions'},
{'input': {'query': 'What are your business hours?'},
'expected': 'business hours information'},
# ... 100+ examples
],
task=lambda input: my_llm_pipeline(input['query']),
scores=[
Factuality, # Is the answer factually correct?
LLMClassifier(
name='helpful',
prompt_template='Is this response helpful? Yes/No. Response: {{output}}',
choice_scores={'Yes': 1, 'No': 0}
)
]
)def eval_json_extraction(output: str, expected: dict) -> dict:
'''
Eval function for structured JSON extraction tasks.
Returns score 0-1 and per-field breakdown.
'''
try:
parsed = json.loads(output)
except json.JSONDecodeError:
return {'score': 0.0, 'reason': 'Invalid JSON', 'fields': {}}
field_scores = {}
for field, expected_value in expected.items():
actual = parsed.get(field)
if actual == expected_value:
field_scores[field] = 1.0
elif actual is not None and str(actual).lower() in str(expected_value).lower():
field_scores[field] = 0.5 # Partial match
else:
field_scores[field] = 0.0
overall = sum(field_scores.values()) / len(field_scores)
return {'score': overall, 'fields': field_scores, 'reason': f'{sum(v==1 for v in field_scores.values())}/{len(field_scores)} fields exact match'}Common Mistakes
- ✗Treating evals as pass/fail tests — LLM outputs are probabilistic. A 5% drop in score is significant; a 0.5% drop is noise. Set score thresholds with appropriate tolerances and track trends, not individual run scores.
- ✗Deleting or modifying eval examples — eval datasets must be append-only to maintain historical comparability. If an expected answer was wrong, add a corrected version as a new example rather than modifying the existing one.
- ✗Not separating train and eval data — examples used to develop/tune a prompt should not be in the eval set. This overfitting risk is identical to ML — your eval set must be held out from prompt development.
- ✗Only running evals before deploys, not in production — offline evals catch regressions but not distribution shift (real user queries differ from your examples). Add production sampling to your eval pipeline: randomly score 1% of production queries.
FAQ
What eval framework should I use?+
Braintrust is the most full-featured commercial option with strong CI integration. LangSmith (LangChain) is popular if you're already using LangChain. Promptfoo is a good open-source option for simpler needs. OpenAI Evals provides a standardized format. For simple applications, a custom eval runner with pytest is sufficient and avoids vendor lock-in.
How many eval examples do I need?+
Minimum 50 examples per eval category for reliable metrics (lower has high variance). 200+ examples for production CI. For rare failure modes, add examples whenever you encounter failures in production. The eval dataset should grow continuously — treat every production failure as an eval example to add.
What should I eval when I have no labeled data?+
Start with model-generated evals: use a powerful model to generate 50 question/answer pairs from your domain documents, then manually validate 20% of them. Supplement with behavioral tests: is the output always JSON? Always under 500 words? Never contains [REDACTED]? These require no labels and catch obvious regressions.
How do I run evals cost-efficiently?+
Tiered eval strategy: (1) Fast evals (regex, schema validation, exact match) — run on every commit, cost <$0.01. (2) Medium evals (LLM judge with cheap model) — run on PR review, cost <$1. (3) Full evals (GPT-4o judge, large dataset) — run on main branch, cost $5-20. Only escalate to expensive evals when fast evals pass.
How do evals relate to monitoring?+
Evals are offline batch processes run on curated datasets. Monitoring is online scoring of production traffic in real time. They're complementary: evals catch regressions before deploy; monitoring catches distribution shift after deploy. An ideal setup: evals in CI + 1% production sampling with LLM judge + alerting on score degradation.