Reference Architecture · classification

Automated LLM Evaluation Harness: CI/CD for AI Quality

Last updated: April 16, 2026

Quick answer

Build an eval harness with 3 layers: golden test sets (200-1000 human-verified examples), LLM-as-judge for qualitative scoring (Claude Haiku 4 at $0.80/M tokens), and CI/CD integration that blocks deploys when quality drops >3%. Typical eval run for 500 examples: 2-5 minutes, $0.50-2. Run on every PR that touches prompts or model config.

The problem

Teams ship prompt changes and model upgrades without systematic quality checks, then discover regressions only through user complaints. In production AI systems, 60-70% of quality degradations are caused by prompt drift (subtle wording changes), model version updates, or dependency changes — all invisible without automated evals. A typical mid-size AI product (10K users) loses $5,000-20,000 in churn per major regression before it's caught.

Architecture

test casesprompt + inputoutput + expectedoutput + rubricmetric scoresquality scoresbaseline scorespass/fail signalresultsdelta reportGolden Test DatasetINPUTEval Test RunnerINFRAModel Under TestLLMDeterministic ScorerINFRALLM-as-Judge ScorerLLMBaseline Results StoreINFRARegression DetectorINFRACI/CD GateINFRAEval Results DashboardOUTPUT
input
llm
data
infra
output

Golden Test Dataset

A curated, versioned set of (input, expected_output) pairs representing the full distribution of real user inputs. Should include edge cases, adversarial inputs, and regression cases from past bugs. Never auto-generate golden labels — require human review.

Alternatives: Labelbox, Scale AI, Custom Postgres table, Braintrust datasets

Eval Test Runner

Orchestrates the eval run: loads test cases, calls the model under test (with the candidate prompt/config), collects raw outputs, and passes them to scorers. Must support parallel execution (50-100 concurrent API calls) to keep eval time under 5 minutes.

Alternatives: LangSmith evaluations, Custom Python script + asyncio, EleutherAI lm-evaluation-harness

Model Under Test

The candidate model/prompt combination being evaluated. Can be a new model version, a modified system prompt, or a new RAG configuration. The eval harness should abstract the model interface so you can swap models without rewriting evals.

Alternatives: gpt-4o, gemini-2-flash, fine-tuned Llama 3.1 8B

Deterministic Scorer

Computes exact metrics where ground truth is unambiguous: exact match, regex match, JSON schema validation, SQL execution correctness, code unit test pass rate. Fast (<1ms per example) and 100% reproducible — no LLM cost.

Alternatives: ROUGE-L, BERTScore, SacreBLEU, custom AST diff for code

LLM-as-Judge Scorer

Uses a powerful LLM to score model outputs on qualitative dimensions: helpfulness, accuracy, tone, safety, hallucination detection. Requires a carefully designed rubric (1-5 scale with anchored descriptions). LLM judge itself must be validated against human ratings.

Alternatives: GPT-4o mini (judge), Gemini Flash 2.0, Claude Sonnet 4 for high-stakes evals

Baseline Results Store

Stores eval results from the current production model as a baseline. New candidate results are compared against this baseline to compute the delta. Stores per-example results, not just aggregate scores — enables drilling into which test cases regressed.

Alternatives: Braintrust experiments, MLflow tracking, S3 + DuckDB

Regression Detector

Compares candidate eval scores vs baseline. Raises a blocking failure if overall score drops >3%, or if any specific test case category (e.g., safety, refusals) drops >1%. Outputs a diff report showing which examples improved and which regressed.

Alternatives: Braintrust experiment comparison, Promptfoo CI assertions

CI/CD Gate

Runs the eval harness on every pull request that modifies prompts, model config, or RAG pipeline. Blocks merge if regression thresholds are violated. Posts eval summary as a PR comment with links to failing test cases.

Alternatives: GitLab CI, CircleCI, Buildkite

Eval Results Dashboard

Visual interface showing eval trends over time, per-category score breakdowns, and example-level diffs. Enables non-engineers (PMs, QA) to track AI quality without reading code.

Alternatives: Custom Grafana dashboard, Observable notebook, Metabase on Postgres

The stack

Eval FrameworkPromptfoo (open-source, self-hosted)

Promptfoo is the most mature open-source eval framework: YAML-defined test suites, built-in LLM-as-judge providers, CI/CD integration, and a comparison UI. Runs 500 test cases in <3 minutes with 100 concurrent API calls. Free and self-hostable.

Alternatives: Braintrust (managed, $50-500/mo), LangSmith ($39-299/mo), Custom Python asyncio script

LLM Judge ModelClaude Haiku 4 (claude-haiku-4-5)

Claude Haiku 4 at $0.80/M input tokens + $4/M output tokens costs $0.001-0.003 per eval example. Running 500 examples costs $0.50-1.50. Haiku's judge quality correlates 0.82 with Claude Sonnet on most rubrics — acceptable for regression testing. Use Sonnet only when catching subtle hallucinations.

Alternatives: GPT-4o mini, Gemini Flash 2.0, Claude Sonnet 4 (for high-stakes evals)

Test Dataset ManagementArgilla (self-hosted on Railway or Render)

Argilla provides a human annotation UI, dataset versioning, and conflict resolution for multi-annotator datasets — all for free (open-source). Deploy on Railway for $20/mo. Critical: test datasets must be version-controlled alongside code — a dataset without provenance is unusable for regression comparison.

Alternatives: Labelbox, Scale AI (for large annotation budgets), Custom Postgres + simple UI, Braintrust datasets

Results StoragePostgres (Supabase free tier) with per-example result rows

Store every (eval_run_id, test_case_id, model_id, prompt_hash, score, raw_output) row — this is the atomic unit of analysis. At 500 examples per run and 10 runs/mo, that's 5,000 rows/mo — trivially small for Postgres. Query patterns (which examples regressed?) require row-level granularity, not just aggregates.

Alternatives: MLflow tracking server, Braintrust experiments, DuckDB + S3

CI/CD IntegrationGitHub Actions with eval job as required check

GitHub Actions runs eval on every PR touching prompt files (use path filters: `paths: ['src/prompts/**', 'src/config/models.ts']`). Mark the eval job as a required status check in branch protection rules. Average CI eval run: 2-4 minutes, $0.005 in GitHub Actions compute. Block merge on >3% quality regression.

Alternatives: GitLab CI/CD, CircleCI, Buildkite

Dashboard & ReportingBraintrust (managed UI)

Braintrust's experiment comparison view shows side-by-side score diffs with example-level drill-down — the single most valuable feature for debugging regressions. Starter plan is free for up to 1,000 logged examples/mo.

Alternatives: LangSmith, Custom Grafana + PostgreSQL datasource, Observable notebooks

Cost at each scale

Prototype

10 eval runs/mo, 100 test cases each

$15/mo

LLM-as-judge (Claude Haiku 4, 1K examples/mo at ~$0.002 each)$2
Model-under-test API calls (100 cases x 10 runs, GPT-4o mini)$2
Argilla on Railway (self-hosted)$5
Supabase free tier (results storage)$0
GitHub Actions compute$1
Braintrust starter (free tier)$0
Dev time to set up (amortized)$5

Growth

50 eval runs/mo, 500 test cases each

$180/mo

LLM-as-judge (Claude Haiku 4, 25K examples/mo)$50
Model-under-test API calls (500 x 50 runs)$75
Braintrust Teams ($50/mo)$50
Supabase Pro (results storage + DB)$25
GitHub Actions (additional minutes)$10
Argilla hosting upgrade$20

Scale

200 eval runs/mo, 2,000 test cases each

$1,800/mo

LLM-as-judge (400K examples/mo, Claude Haiku 4)$800
Model-under-test API calls at scale$600
Braintrust Enterprise or self-hosted LangSmith$200
Dedicated Postgres (eval results)$100
CI compute (dedicated runners)$100

Latency budget

Total P50: 120,000ms
Total P95: 300,000ms
Total
120000ms · 300000ms p95
Median
P95

Tradeoffs

Failure modes & guardrails

Mitigation: The judge rubric itself can 'drift' — a wording change in the judge prompt causes score shifts unrelated to the model under test. Version-control the judge prompt separately and run judge calibration evals quarterly: score 50 human-labeled examples with the judge and compute correlation. Alert if correlation drops below 0.80.

Mitigation: Teams optimize prompts specifically for the known test set, creating an 'eval-gaming' problem. Prevention: keep a held-out shadow test set that is never used for development or optimization — only for final promotion decisions. Rotate 10% of the main test set out each month.

Mitigation: Eval runners hitting API rate limits cause timeouts and incomplete runs. Implement exponential backoff with jitter in the test runner. Alternatively, use Anthropic's Batch API for all eval runs — it bypasses synchronous rate limits at 50% cost and returns results within 24 hours.

Mitigation: LLM judge score variance causes valid PRs to fail CI. Fix with: (1) run each test case 2x and average scores; (2) use statistical significance testing (t-test with p<0.05) rather than raw thresholds; (3) only block on drops >3% on the aggregate, not on individual test cases.

View starter code →

Frequently asked questions

How do I prevent prompt changes from breaking production without running evals on every commit?

Use GitHub Actions path filters to run evals only when prompt files, model config, or RAG pipeline code changes. For other changes, skip the eval. Structure your repo so prompts live in dedicated files (src/prompts/*.ts, not hardcoded in components) so the path filter is precise. This reduces eval CI runs by 80-90% while catching all prompt-related regressions.

How many test cases do I need in my golden set?

200 cases is the minimum viable golden set for detecting regressions with >80% confidence. For 95% confidence in detecting a 5% quality drop, you need ~400 cases. Beyond 1,000 cases, the marginal value per case decreases rapidly — focus on quality and distribution coverage over raw count. Ensure your test set includes: 60% typical cases, 20% edge cases, 20% adversarial/tricky cases.

How do I validate that my LLM judge is actually reliable?

Run a judge calibration: take 100 examples, have 2-3 humans rate them on the same 1-5 rubric, compute Spearman rank correlation between human ratings and LLM judge scores. Acceptable threshold: ≥0.75. Common failure mode: the judge rubric is underspecified — add explicit scoring anchors (e.g., 'A score of 5 means the response is factually correct AND addresses all parts of the question AND uses a helpful tone'). Re-calibrate quarterly.

Should I use the same LLM that I'm evaluating as the judge?

Avoid using the model under test as its own judge — it creates self-serving bias and inflates scores by 0.3-0.8 points on average. Always use a different model family as judge. If you're evaluating Claude Sonnet 4, use GPT-4o or Gemini as judge. If evaluating GPT-4o, use Claude as judge. Using a weaker model as judge (Haiku judging Sonnet) is acceptable for regression testing but introduces quality ceiling issues — the judge cannot detect errors it can't recognize itself.

Related