evaluationintermediate

Regression Testing for LLM Applications (2026)

Quick Answer

LLM regression testing runs a fixed evaluation set before and after a change to detect quality drops. Unlike software unit tests, regression tests for LLMs use probabilistic scoring (not exact match) and statistical significance testing to distinguish genuine regressions from random variance. A meaningful regression is a >5% drop in score that's consistent across multiple runs.

When to Use

✓Before deploying any prompt change, even small wording tweaks that seem innocuous
✓When upgrading to a new model version (gpt-4o-mini → gpt-4o-mini-2026) to verify no degradation
✓After changing the retrieval strategy in a RAG system
✓When a user reports a quality issue — add their case to the regression suite and verify the fix doesn't break other cases
✓Before rolling out a new feature that touches the LLM pipeline

How It Works

1Build a fixed regression suite: cases from historical failures, edge cases you've discovered, and representative positive examples. This suite never changes (only grows). Version-control it with your code.
2Run tests against a baseline (current production) and the change candidate simultaneously. Compare scores statistically rather than by absolute values.
3Use appropriate scoring for the task: exact match for structured outputs, LLM-as-judge for open-ended, schema validation for JSON, functional tests for code. Choose the most deterministic scoring method available.
4Set a regression threshold: block deployment if candidate scores drop more than 5% below baseline on any dimension, or if the drop is statistically significant (p < 0.05 with N=5 runs per test case).
5Categorize tests by severity: P0 (must not regress — core functionality), P1 (should not regress — important features), P2 (track but allow regression with explanation). Block deploys only on P0 failures.

Examples

Regression test with promptfoo

# promptfoo config: promptfooconfig.yaml
prompts:
  - 'prompts/v1/system.txt'
  - 'prompts/v2/system.txt'  # candidate change

models:
  - anthropic:claude-3-5-haiku-20241022

tests:
  - description: 'Extracts invoice number correctly'
    vars:
      document: '{{invoice_text}}'
    assert:
      - type: javascript
        value: 'output.includes("INV-2024")'
  - description: 'Handles missing date field gracefully'
    vars:
      document: '{{invoice_missing_date}}'
    assert:
      - type: llm-rubric
        value: 'Response acknowledges the missing date and does not hallucinate one'

# Run: promptfoo eval --no-cache
# Compare v1 vs v2 scores side by side

Output:promptfoo runs all tests against both prompts and shows pass/fail comparison. Fail rate increase from v1 to v2 > 5% → block the change. Built-in LLM-rubric scorer handles open-ended evaluations.

Statistical regression detection

from scipy import stats
import numpy as np

def detect_regression(baseline_scores: list[float], candidate_scores: list[float], threshold: float = 0.05) -> dict:
    '''
    Returns whether candidate is a statistically significant regression from baseline.
    '''
    baseline_mean = np.mean(baseline_scores)
    candidate_mean = np.mean(candidate_scores)
    
    # Welch's t-test (handles unequal variance)
    t_stat, p_value = stats.ttest_ind(baseline_scores, candidate_scores)
    
    pct_change = (candidate_mean - baseline_mean) / baseline_mean
    
    is_regression = (
        p_value < 0.05 and  # Statistically significant
        pct_change < -threshold  # Meaningfully worse (>5% drop)
    )
    
    return {
        'is_regression': is_regression,
        'baseline_mean': baseline_mean,
        'candidate_mean': candidate_mean,
        'pct_change': pct_change,
        'p_value': p_value,
        'significant': p_value < 0.05
    }

Output:Requires both statistical significance (p < 0.05) AND meaningful effect size (>5% drop) to flag a regression. Prevents false alarms from random variance in small test sets.

Common Mistakes

✗Using exact string matching for open-ended outputs — 'The answer is 42' and '42' are the same answer. Use semantic equivalence checking (LLM judge or embedding similarity) rather than exact match for natural language outputs.
✗Not running enough trials per test case — LLM outputs vary between runs. Running each case once gives unreliable scores. Run each case 3–5 times and compare means, especially for cases near the pass/fail threshold.
✗Regression tests that are too sensitive — if 50% of your regression tests fail every deploy due to trivial rewording, the team stops trusting them. Calibrate test sensitivity: important behaviors should fail hard; stylistic preferences should have wide tolerance.
✗Not adding production failures to the regression suite — every time a user reports a quality issue, add that case to the regression suite before fixing it. This ensures the fix works and prevents recurrence.

FAQ

How do I set regression thresholds?+

Base thresholds on your quality SLAs and the cost of a regression. For a critical feature (customer-facing, affects revenue): threshold at 2–3% drop. For internal or auxiliary features: threshold at 5–10% drop. For exploratory/beta features: threshold at 15%+ or no blocking threshold. Review thresholds quarterly as your quality baseline improves.

What should I do when a test starts failing intermittently?+

Intermittent failures indicate: (1) The LLM behavior on this case is unstable — update the expected output or use a softer assertion, (2) The eval function is too strict — use semantic equivalence instead of exact match, (3) The underlying model is changing — some providers do silent model updates. Track intermittent failures in a flakiness dashboard and fix or remove them.

How do I test for regressions in agent systems?+

Agent regression tests are harder because success is path-dependent. Use end-to-end task completion as the metric: run the full agent on a fixed task set and measure pass@1 rate. A regression is a drop in pass@1 rate. Supplement with step-level assertions on known critical decision points in the agent's typical trajectories.

Should I test prompts independently from the model?+

Yes — always test the combination of (prompt, model) as a unit. A prompt optimized for Claude 3.5 Sonnet may perform worse on Claude 3.7 Sonnet despite the newer model being 'better'. Lock both prompt version and model version in your regression tests, and run the full suite when upgrading either.

How do I handle non-determinism in regression tests with temperature > 0?+

For regression testing, prefer temperature=0 for deterministic scoring. If the production system uses temperature > 0, run regression tests at both temperature=0 (for baseline comparison) and production temperature (for representative scores). Report both: temp=0 results catch logical regressions; production-temp results catch distribution-level regressions.

evals framework llm as judge golden dataset a b model testing ↗ llm eval harness ↗ qa testing agent

Regression Testing for LLM Applications (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related