evaluationintermediate

Building an Evals Framework for LLM Applications (2026)

Quick Answer

An evals framework is the testing infrastructure for LLM applications. It consists of: a dataset of input/expected output pairs, eval functions that score actual outputs, a runner that executes evals at scale, and a dashboard that tracks metrics over time. Unlike unit tests, evals accept imperfect scores — the goal is to detect regressions (score drops), not achieve 100%.

When to Use

✓Before shipping any LLM feature to production — validate it meets quality thresholds
✓After every prompt change, model upgrade, or retrieval system change — catch regressions
✓When debugging user-reported quality issues — run targeted evals to reproduce and measure the problem
✓Comparing two approaches (different models, prompts, RAG strategies) to make a data-driven choice
✓Setting up ongoing quality monitoring for production LLM applications

How It Works

1Structure evals in three layers: unit evals (single input/output pair), scenario evals (multi-turn conversation), and end-to-end evals (full task completion). Start with unit evals and add complexity as needed.
2Eval functions score a single output: exact match, substring match, regex match, JSON schema validation, LLM-as-judge, or custom programmatic checks. Each eval function returns a score (0-1) and optional metadata.
3Build a dataset of (input, expected output, metadata) triples. Start with 50–100 examples per major use case. Add failure cases as you find them. Never delete examples — only append.
4Run evals against multiple model versions simultaneously using a configuration system. This enables side-by-side comparison and ensures you only deploy improvements.
5CI integration: run fast evals (under 2 minutes) on every PR. Run full evals (30+ minutes) on main branch merges. Block deploys if score drops more than 5% on any critical dimension.

Examples

Braintrust eval setup

import braintrust
from braintrust import Eval
from autoevals import Factuality, LLMClassifier

# Define your eval
Eval(
    'customer-support-qa',
    data=lambda: [
        {'input': {'query': 'How do I cancel my subscription?'}, 
         'expected': 'cancel subscription instructions'},
        {'input': {'query': 'What are your business hours?'},
         'expected': 'business hours information'},
        # ... 100+ examples
    ],
    task=lambda input: my_llm_pipeline(input['query']),
    scores=[
        Factuality,  # Is the answer factually correct?
        LLMClassifier(
            name='helpful',
            prompt_template='Is this response helpful? Yes/No. Response: {{output}}',
            choice_scores={'Yes': 1, 'No': 0}
        )
    ]
)

Output:Runs all eval cases, scores each on Factuality and helpfulness, stores results in Braintrust. Compare scores across runs, drill into failures, track improvement over time.

Custom eval function

def eval_json_extraction(output: str, expected: dict) -> dict:
    '''
    Eval function for structured JSON extraction tasks.
    Returns score 0-1 and per-field breakdown.
    '''
    try:
        parsed = json.loads(output)
    except json.JSONDecodeError:
        return {'score': 0.0, 'reason': 'Invalid JSON', 'fields': {}}
    
    field_scores = {}
    for field, expected_value in expected.items():
        actual = parsed.get(field)
        if actual == expected_value:
            field_scores[field] = 1.0
        elif actual is not None and str(actual).lower() in str(expected_value).lower():
            field_scores[field] = 0.5  # Partial match
        else:
            field_scores[field] = 0.0
    
    overall = sum(field_scores.values()) / len(field_scores)
    return {'score': overall, 'fields': field_scores, 'reason': f'{sum(v==1 for v in field_scores.values())}/{len(field_scores)} fields exact match'}

Output:Custom eval gives per-field breakdown, not just pass/fail. This shows which fields are systematically wrong (extraction error in one field = targeted fix needed) vs. random failures.

Common Mistakes

✗Treating evals as pass/fail tests — LLM outputs are probabilistic. A 5% drop in score is significant; a 0.5% drop is noise. Set score thresholds with appropriate tolerances and track trends, not individual run scores.
✗Deleting or modifying eval examples — eval datasets must be append-only to maintain historical comparability. If an expected answer was wrong, add a corrected version as a new example rather than modifying the existing one.
✗Not separating train and eval data — examples used to develop/tune a prompt should not be in the eval set. This overfitting risk is identical to ML — your eval set must be held out from prompt development.
✗Only running evals before deploys, not in production — offline evals catch regressions but not distribution shift (real user queries differ from your examples). Add production sampling to your eval pipeline: randomly score 1% of production queries.

FAQ

What eval framework should I use?+

Braintrust is the most full-featured commercial option with strong CI integration. LangSmith (LangChain) is popular if you're already using LangChain. Promptfoo is a good open-source option for simpler needs. OpenAI Evals provides a standardized format. For simple applications, a custom eval runner with pytest is sufficient and avoids vendor lock-in.

How many eval examples do I need?+

Minimum 50 examples per eval category for reliable metrics (lower has high variance). 200+ examples for production CI. For rare failure modes, add examples whenever you encounter failures in production. The eval dataset should grow continuously — treat every production failure as an eval example to add.

What should I eval when I have no labeled data?+

Start with model-generated evals: use a powerful model to generate 50 question/answer pairs from your domain documents, then manually validate 20% of them. Supplement with behavioral tests: is the output always JSON? Always under 500 words? Never contains [REDACTED]? These require no labels and catch obvious regressions.

How do I run evals cost-efficiently?+

Tiered eval strategy: (1) Fast evals (regex, schema validation, exact match) — run on every commit, cost <$0.01. (2) Medium evals (LLM judge with cheap model) — run on PR review, cost <$1. (3) Full evals (GPT-4o judge, large dataset) — run on main branch, cost $5-20. Only escalate to expensive evals when fast evals pass.

How do evals relate to monitoring?+

Evals are offline batch processes run on curated datasets. Monitoring is online scoring of production traffic in real time. They're complementary: evals catch regressions before deploy; monitoring catches distribution shift after deploy. An ideal setup: evals in CI + 1% production sampling with LLM judge + alerting on score degradation.

llm as judge golden dataset regression testing rag evaluation ↗ llm eval harness ↗ qa testing agent

Building an Evals Framework for LLM Applications (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related