Question 1

What is LLM-as-judge and how reliable is it?

Accepted Answer

LLM-as-judge means using a strong LLM (Claude Opus 4, GPT-4o) to score the outputs of another LLM. Studies show strong models achieve 70–85% agreement with human raters when given detailed rubrics. Known biases include: preference for longer responses, preference for outputs from the same model family, and positional bias (preferring the first option in pairwise comparisons). Use calibration techniques: random order shuffling, multi-metric rubrics, and human spot-checking.

Question 2

What is a golden dataset?

Accepted Answer

A golden dataset (or golden set) is a curated collection of inputs with known, correct, ideal outputs. These are human-reviewed and represent the full distribution of cases your system will encounter — including edge cases, ambiguous inputs, and failure modes. Golden datasets serve as regression tests: if a model or prompt change causes any golden examples to fail, you catch the regression before it reaches production.

Question 3

How many examples do I need for a useful golden dataset?

Accepted Answer

50 examples is enough to start getting useful signal. 200 examples covers most common task distributions adequately. 500+ examples gives you statistical power to detect small quality regressions. Quality and diversity matter more than quantity — 50 carefully chosen examples that cover your failure modes are worth more than 500 easy, similar examples.

Question 4

What tools should I use for LLM evaluation?

Accepted Answer

Popular options in 2026: Braintrust (best for teams, hosted, good LLM-as-judge tooling), Langsmith (tight LangChain integration, good tracing), Promptfoo (open-source, best for prompt A/B testing), RAGAS (specialized for RAG evaluation), Helicone (production monitoring + evaluation). For code, EvalPlus provides execution-based benchmarking. Many teams eventually build custom eval harnesses once they have specific domain metrics.

How Should You Evaluate Your LLM Application?

What type of task is your LLM application performing?

FAQ

Related Tools