Reference Architecture · classification
Automated LLM Evaluation Harness: CI/CD for AI Quality
Last updated: April 16, 2026
Quick answer
Build an eval harness with 3 layers: golden test sets (200-1000 human-verified examples), LLM-as-judge for qualitative scoring (Claude Haiku 4 at $0.80/M tokens), and CI/CD integration that blocks deploys when quality drops >3%. Typical eval run for 500 examples: 2-5 minutes, $0.50-2. Run on every PR that touches prompts or model config.
The problem
Teams ship prompt changes and model upgrades without systematic quality checks, then discover regressions only through user complaints. In production AI systems, 60-70% of quality degradations are caused by prompt drift (subtle wording changes), model version updates, or dependency changes — all invisible without automated evals. A typical mid-size AI product (10K users) loses $5,000-20,000 in churn per major regression before it's caught.
Architecture
Golden Test Dataset
A curated, versioned set of (input, expected_output) pairs representing the full distribution of real user inputs. Should include edge cases, adversarial inputs, and regression cases from past bugs. Never auto-generate golden labels — require human review.
Alternatives: Labelbox, Scale AI, Custom Postgres table, Braintrust datasets
Eval Test Runner
Orchestrates the eval run: loads test cases, calls the model under test (with the candidate prompt/config), collects raw outputs, and passes them to scorers. Must support parallel execution (50-100 concurrent API calls) to keep eval time under 5 minutes.
Alternatives: LangSmith evaluations, Custom Python script + asyncio, EleutherAI lm-evaluation-harness
Model Under Test
The candidate model/prompt combination being evaluated. Can be a new model version, a modified system prompt, or a new RAG configuration. The eval harness should abstract the model interface so you can swap models without rewriting evals.
Alternatives: gpt-4o, gemini-2-flash, fine-tuned Llama 3.1 8B
Deterministic Scorer
Computes exact metrics where ground truth is unambiguous: exact match, regex match, JSON schema validation, SQL execution correctness, code unit test pass rate. Fast (<1ms per example) and 100% reproducible — no LLM cost.
Alternatives: ROUGE-L, BERTScore, SacreBLEU, custom AST diff for code
LLM-as-Judge Scorer
Uses a powerful LLM to score model outputs on qualitative dimensions: helpfulness, accuracy, tone, safety, hallucination detection. Requires a carefully designed rubric (1-5 scale with anchored descriptions). LLM judge itself must be validated against human ratings.
Alternatives: GPT-4o mini (judge), Gemini Flash 2.0, Claude Sonnet 4 for high-stakes evals
Baseline Results Store
Stores eval results from the current production model as a baseline. New candidate results are compared against this baseline to compute the delta. Stores per-example results, not just aggregate scores — enables drilling into which test cases regressed.
Alternatives: Braintrust experiments, MLflow tracking, S3 + DuckDB
Regression Detector
Compares candidate eval scores vs baseline. Raises a blocking failure if overall score drops >3%, or if any specific test case category (e.g., safety, refusals) drops >1%. Outputs a diff report showing which examples improved and which regressed.
Alternatives: Braintrust experiment comparison, Promptfoo CI assertions
CI/CD Gate
Runs the eval harness on every pull request that modifies prompts, model config, or RAG pipeline. Blocks merge if regression thresholds are violated. Posts eval summary as a PR comment with links to failing test cases.
Alternatives: GitLab CI, CircleCI, Buildkite
Eval Results Dashboard
Visual interface showing eval trends over time, per-category score breakdowns, and example-level diffs. Enables non-engineers (PMs, QA) to track AI quality without reading code.
Alternatives: Custom Grafana dashboard, Observable notebook, Metabase on Postgres
The stack
Promptfoo is the most mature open-source eval framework: YAML-defined test suites, built-in LLM-as-judge providers, CI/CD integration, and a comparison UI. Runs 500 test cases in <3 minutes with 100 concurrent API calls. Free and self-hostable.
Alternatives: Braintrust (managed, $50-500/mo), LangSmith ($39-299/mo), Custom Python asyncio script
Claude Haiku 4 at $0.80/M input tokens + $4/M output tokens costs $0.001-0.003 per eval example. Running 500 examples costs $0.50-1.50. Haiku's judge quality correlates 0.82 with Claude Sonnet on most rubrics — acceptable for regression testing. Use Sonnet only when catching subtle hallucinations.
Alternatives: GPT-4o mini, Gemini Flash 2.0, Claude Sonnet 4 (for high-stakes evals)
Argilla provides a human annotation UI, dataset versioning, and conflict resolution for multi-annotator datasets — all for free (open-source). Deploy on Railway for $20/mo. Critical: test datasets must be version-controlled alongside code — a dataset without provenance is unusable for regression comparison.
Alternatives: Labelbox, Scale AI (for large annotation budgets), Custom Postgres + simple UI, Braintrust datasets
Store every (eval_run_id, test_case_id, model_id, prompt_hash, score, raw_output) row — this is the atomic unit of analysis. At 500 examples per run and 10 runs/mo, that's 5,000 rows/mo — trivially small for Postgres. Query patterns (which examples regressed?) require row-level granularity, not just aggregates.
Alternatives: MLflow tracking server, Braintrust experiments, DuckDB + S3
GitHub Actions runs eval on every PR touching prompt files (use path filters: `paths: ['src/prompts/**', 'src/config/models.ts']`). Mark the eval job as a required status check in branch protection rules. Average CI eval run: 2-4 minutes, $0.005 in GitHub Actions compute. Block merge on >3% quality regression.
Alternatives: GitLab CI/CD, CircleCI, Buildkite
Braintrust's experiment comparison view shows side-by-side score diffs with example-level drill-down — the single most valuable feature for debugging regressions. Starter plan is free for up to 1,000 logged examples/mo.
Alternatives: LangSmith, Custom Grafana + PostgreSQL datasource, Observable notebooks
Cost at each scale
Prototype
10 eval runs/mo, 100 test cases each
$15/mo
Growth
50 eval runs/mo, 500 test cases each
$180/mo
Scale
200 eval runs/mo, 2,000 test cases each
$1,800/mo
Latency budget
Tradeoffs
Failure modes & guardrails
Mitigation: The judge rubric itself can 'drift' — a wording change in the judge prompt causes score shifts unrelated to the model under test. Version-control the judge prompt separately and run judge calibration evals quarterly: score 50 human-labeled examples with the judge and compute correlation. Alert if correlation drops below 0.80.
Mitigation: Teams optimize prompts specifically for the known test set, creating an 'eval-gaming' problem. Prevention: keep a held-out shadow test set that is never used for development or optimization — only for final promotion decisions. Rotate 10% of the main test set out each month.
Mitigation: Eval runners hitting API rate limits cause timeouts and incomplete runs. Implement exponential backoff with jitter in the test runner. Alternatively, use Anthropic's Batch API for all eval runs — it bypasses synchronous rate limits at 50% cost and returns results within 24 hours.
Mitigation: LLM judge score variance causes valid PRs to fail CI. Fix with: (1) run each test case 2x and average scores; (2) use statistical significance testing (t-test with p<0.05) rather than raw thresholds; (3) only block on drops >3% on the aggregate, not on individual test cases.
Frequently asked questions
How do I prevent prompt changes from breaking production without running evals on every commit?
Use GitHub Actions path filters to run evals only when prompt files, model config, or RAG pipeline code changes. For other changes, skip the eval. Structure your repo so prompts live in dedicated files (src/prompts/*.ts, not hardcoded in components) so the path filter is precise. This reduces eval CI runs by 80-90% while catching all prompt-related regressions.
How many test cases do I need in my golden set?
200 cases is the minimum viable golden set for detecting regressions with >80% confidence. For 95% confidence in detecting a 5% quality drop, you need ~400 cases. Beyond 1,000 cases, the marginal value per case decreases rapidly — focus on quality and distribution coverage over raw count. Ensure your test set includes: 60% typical cases, 20% edge cases, 20% adversarial/tricky cases.
How do I validate that my LLM judge is actually reliable?
Run a judge calibration: take 100 examples, have 2-3 humans rate them on the same 1-5 rubric, compute Spearman rank correlation between human ratings and LLM judge scores. Acceptable threshold: ≥0.75. Common failure mode: the judge rubric is underspecified — add explicit scoring anchors (e.g., 'A score of 5 means the response is factually correct AND addresses all parts of the question AND uses a helpful tone'). Re-calibrate quarterly.
Should I use the same LLM that I'm evaluating as the judge?
Avoid using the model under test as its own judge — it creates self-serving bias and inflates scores by 0.3-0.8 points on average. Always use a different model family as judge. If you're evaluating Claude Sonnet 4, use GPT-4o or Gemini as judge. If evaluating GPT-4o, use Claude as judge. Using a weaker model as judge (Haiku judging Sonnet) is acceptable for regression testing but introduces quality ceiling issues — the judge cannot detect errors it can't recognize itself.
Related
Architectures
End-to-End Fine-Tuning Pipeline: From Data to Deployment
A complete fine-tuning pipeline covering data collection, cleaning, formatting, LoRA training, evaluation, and...
Prompt Caching & Cost Optimization: 90% Savings on Repetitive Prompts
Architecture for Anthropic and OpenAI prompt caching: cache design patterns, minimum token thresholds, hit rat...
Realtime Content Moderation Pipeline
Reference architecture for moderating user-generated text and images in realtime. Tiered policy classifier, hu...
QA Testing Agent
Reference architecture for an agent that generates test cases from code and requirements, runs them, and diagn...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...