LLM-as-Judge: Using AI to Evaluate AI Outputs (2026)
LLM-as-judge prompts a capable LLM to evaluate another LLM's output on specific criteria. It's 10-100x cheaper than human evaluation and highly correlated with human judgments (0.8+ Spearman correlation on most NLP tasks). The key pitfalls: position bias (favoring first response), self-preference bias (models favor their own style), and verbosity bias (longer answers score higher). Use these techniques to mitigate them.
When to Use
- ✓Evaluating open-ended outputs (summaries, explanations, creative text) where rule-based metrics fail
- ✓Scaling evaluation to thousands or millions of examples where human labeling is cost-prohibitive
- ✓Building continuous eval pipelines that monitor production LLM quality automatically
- ✓Comparing two model versions on the same task set to decide which to deploy
- ✓Evaluating RAG faithfulness — checking if answers are grounded in retrieved context
How It Works
- 1Choose evaluation dimensions: correctness (factually accurate?), faithfulness (supported by context?), helpfulness (addressed the question?), harmlessness (no harmful content?). Each dimension needs a separate criterion in the judge prompt.
- 2Design the judge prompt: provide the question, the answer to evaluate, and optionally a reference answer or context. Ask for a score (1-5 or 1-10) AND reasoning. Reasoning is essential — it catches judge errors and explains low scores.
- 3Use a different, stronger model as judge than the one being evaluated. Evaluating GPT-4o outputs with GPT-4o introduces self-preference bias. Use Claude as judge for GPT-4o outputs and vice versa.
- 4Mitigate position bias: for pairwise comparisons (A vs. B), run the comparison twice with A/B order swapped. A clear winner should score higher in both orderings.
- 5Calibrate the judge: create a small set of human-labeled examples and measure judge-human agreement (Spearman correlation or Cohen's kappa). A well-calibrated judge achieves 0.75+ correlation with human experts.
Examples
import anthropic
import json
client = anthropic.Anthropic()
FAITHFULNESS_PROMPT = '''
You are evaluating whether a response is faithful to the provided context.
Context: {context}
Question: {question}
Response: {response}
Evaluate faithfulness: does the response contain ONLY information that can be verified from the context?
Score:
1 = Response contains significant hallucinations or unsupported claims
2 = Response contains some unsupported claims
3 = Response is mostly faithful with minor unsupported details
4 = Response is faithful with trivial extensions
5 = Response is completely faithful, every claim verifiable from context
Return JSON: {"score": N, "unsupported_claims": ["list of any claims not in context"], "reasoning": "brief explanation"}'''
def judge_faithfulness(context, question, response):
result = client.messages.create(
model='claude-3-5-sonnet-20241022',
max_tokens=300,
messages=[{'role': 'user', 'content': FAITHFULNESS_PROMPT.format(
context=context, question=question, response=response
)}]
)
return json.loads(result.content[0].text)def pairwise_compare(question, response_a, response_b, criteria):
# Run A vs B
score_ab = judge_pairwise(question, response_a, response_b, criteria)
# Run B vs A (swapped)
score_ba = judge_pairwise(question, response_b, response_a, criteria)
# Consistent winner in both orderings
if score_ab['winner'] == 'A' and score_ba['winner'] == 'B':
return 'A' # A wins regardless of position
elif score_ab['winner'] == 'B' and score_ba['winner'] == 'A':
return 'B' # B wins regardless of position
else:
return 'tie' # Inconsistent — likely a marginal differenceCommon Mistakes
- ✗Using the same model as judge and generator — self-evaluation inflates scores due to shared biases and writing style preferences. Always use a different model family for judging than the one being evaluated.
- ✗Asking for a single holistic score — holistic scores conflate multiple quality dimensions and produce less actionable feedback. Break into specific dimensions: correctness, completeness, conciseness, tone. Separate scores are more useful for diagnosing failures.
- ✗Not validating judge calibration — an uncalibrated judge may systematically score 3-4 regardless of actual quality, or show strong recency/position bias. Always correlate judge scores with 50+ human labels before using the judge in production.
- ✗Using LLM judge for safety evaluation — LLM judges are unreliable for detecting subtle policy violations, jailbreaks, or harmful content. Use specialized safety classifiers (Llama Guard, OpenAI Moderation API) for safety evaluation, not general-purpose LLM judges.
FAQ
How correlated is LLM-as-judge with human evaluation?+
On standard NLP benchmarks (MMLU, MT-Bench, Chatbot Arena), GPT-4 and Claude as judges achieve 0.8–0.9 Spearman correlation with human expert labels. For domain-specific tasks, correlation is lower (0.6–0.8) without domain-specific calibration. On safety and subtle quality issues, correlation drops to 0.5–0.7. Always validate against human labels for your specific task.
What's the cheapest model I can use as a judge?+
For simple criteria (is this response on-topic?), GPT-4o-mini and Claude Haiku work well and cost ~$0.001/eval. For complex criteria (multi-dimensional quality scoring of technical responses), use GPT-4o or Claude Sonnet. The quality-cost tradeoff: cheap judges miss subtle errors; expensive judges are overkill for simple binary classifications.
Can LLM judges be gamed?+
Yes — models can be fine-tuned to produce outputs that score highly with specific LLM judges without actually being higher quality. This is 'Goodhart's Law for LLMs.' Mitigations: rotate judge models periodically, include human spot-checks, and use diverse evaluation criteria rather than a single judge score.
What's the difference between LLM-as-judge and RLHF?+
RLHF (Reinforcement Learning from Human Feedback) uses human or AI preferences to fine-tune the model's weights. LLM-as-judge is an inference-time evaluation that scores existing outputs without modifying the model. They're complementary: use LLM-as-judge for offline evaluation and monitoring; use RLHF (or RLAIF) when you want to permanently improve the model.
Should I use absolute scoring or pairwise comparison?+
Pairwise comparison is more reliable and better calibrated (humans are better at saying 'A > B' than 'A is a 7'). Absolute scoring is needed when you don't have a reference to compare against (monitoring production quality). Use pairwise for model comparison tasks; use absolute scoring for production monitoring.