evaluationintermediate

LLM-as-Judge: Using AI to Evaluate AI Outputs (2026)

Quick Answer

LLM-as-judge prompts a capable LLM to evaluate another LLM's output on specific criteria. It's 10-100x cheaper than human evaluation and highly correlated with human judgments (0.8+ Spearman correlation on most NLP tasks). The key pitfalls: position bias (favoring first response), self-preference bias (models favor their own style), and verbosity bias (longer answers score higher). Use these techniques to mitigate them.

When to Use

✓Evaluating open-ended outputs (summaries, explanations, creative text) where rule-based metrics fail
✓Scaling evaluation to thousands or millions of examples where human labeling is cost-prohibitive
✓Building continuous eval pipelines that monitor production LLM quality automatically
✓Comparing two model versions on the same task set to decide which to deploy
✓Evaluating RAG faithfulness — checking if answers are grounded in retrieved context

How It Works

1Choose evaluation dimensions: correctness (factually accurate?), faithfulness (supported by context?), helpfulness (addressed the question?), harmlessness (no harmful content?). Each dimension needs a separate criterion in the judge prompt.
2Design the judge prompt: provide the question, the answer to evaluate, and optionally a reference answer or context. Ask for a score (1-5 or 1-10) AND reasoning. Reasoning is essential — it catches judge errors and explains low scores.
3Use a different, stronger model as judge than the one being evaluated. Evaluating GPT-4o outputs with GPT-4o introduces self-preference bias. Use Claude as judge for GPT-4o outputs and vice versa.
4Mitigate position bias: for pairwise comparisons (A vs. B), run the comparison twice with A/B order swapped. A clear winner should score higher in both orderings.
5Calibrate the judge: create a small set of human-labeled examples and measure judge-human agreement (Spearman correlation or Cohen's kappa). A well-calibrated judge achieves 0.75+ correlation with human experts.

Examples

Faithfulness judge for RAG

import anthropic
import json

client = anthropic.Anthropic()

FAITHFULNESS_PROMPT = '''
You are evaluating whether a response is faithful to the provided context.

Context: {context}
Question: {question}
Response: {response}

Evaluate faithfulness: does the response contain ONLY information that can be verified from the context?

Score:
1 = Response contains significant hallucinations or unsupported claims
2 = Response contains some unsupported claims 
3 = Response is mostly faithful with minor unsupported details
4 = Response is faithful with trivial extensions
5 = Response is completely faithful, every claim verifiable from context

Return JSON: {"score": N, "unsupported_claims": ["list of any claims not in context"], "reasoning": "brief explanation"}'''

def judge_faithfulness(context, question, response):
    result = client.messages.create(
        model='claude-3-5-sonnet-20241022',
        max_tokens=300,
        messages=[{'role': 'user', 'content': FAITHFULNESS_PROMPT.format(
            context=context, question=question, response=response
        )}]
    )
    return json.loads(result.content[0].text)

Output:Returns {"score": 4, "unsupported_claims": [], "reasoning": "All claims verified in context."}. Use scores below 3 as rejection threshold for RAG outputs. Log unsupported_claims for debugging.

Pairwise comparison with position bias mitigation

def pairwise_compare(question, response_a, response_b, criteria):
    # Run A vs B
    score_ab = judge_pairwise(question, response_a, response_b, criteria)
    # Run B vs A (swapped)
    score_ba = judge_pairwise(question, response_b, response_a, criteria)
    
    # Consistent winner in both orderings
    if score_ab['winner'] == 'A' and score_ba['winner'] == 'B':
        return 'A'  # A wins regardless of position
    elif score_ab['winner'] == 'B' and score_ba['winner'] == 'A':
        return 'B'  # B wins regardless of position
    else:
        return 'tie'  # Inconsistent — likely a marginal difference

Output:Double evaluation with swapped order eliminates position bias. Inconsistent results (model A wins when listed first, model B wins when listed second) are correctly classified as ties.

Common Mistakes

✗Using the same model as judge and generator — self-evaluation inflates scores due to shared biases and writing style preferences. Always use a different model family for judging than the one being evaluated.
✗Asking for a single holistic score — holistic scores conflate multiple quality dimensions and produce less actionable feedback. Break into specific dimensions: correctness, completeness, conciseness, tone. Separate scores are more useful for diagnosing failures.
✗Not validating judge calibration — an uncalibrated judge may systematically score 3-4 regardless of actual quality, or show strong recency/position bias. Always correlate judge scores with 50+ human labels before using the judge in production.
✗Using LLM judge for safety evaluation — LLM judges are unreliable for detecting subtle policy violations, jailbreaks, or harmful content. Use specialized safety classifiers (Llama Guard, OpenAI Moderation API) for safety evaluation, not general-purpose LLM judges.

FAQ

How correlated is LLM-as-judge with human evaluation?+

On standard NLP benchmarks (MMLU, MT-Bench, Chatbot Arena), GPT-4 and Claude as judges achieve 0.8–0.9 Spearman correlation with human expert labels. For domain-specific tasks, correlation is lower (0.6–0.8) without domain-specific calibration. On safety and subtle quality issues, correlation drops to 0.5–0.7. Always validate against human labels for your specific task.

What's the cheapest model I can use as a judge?+

For simple criteria (is this response on-topic?), GPT-4o-mini and Claude Haiku work well and cost ~$0.001/eval. For complex criteria (multi-dimensional quality scoring of technical responses), use GPT-4o or Claude Sonnet. The quality-cost tradeoff: cheap judges miss subtle errors; expensive judges are overkill for simple binary classifications.

Can LLM judges be gamed?+

Yes — models can be fine-tuned to produce outputs that score highly with specific LLM judges without actually being higher quality. This is 'Goodhart's Law for LLMs.' Mitigations: rotate judge models periodically, include human spot-checks, and use diverse evaluation criteria rather than a single judge score.

What's the difference between LLM-as-judge and RLHF?+

RLHF (Reinforcement Learning from Human Feedback) uses human or AI preferences to fine-tune the model's weights. LLM-as-judge is an inference-time evaluation that scores existing outputs without modifying the model. They're complementary: use LLM-as-judge for offline evaluation and monitoring; use RLHF (or RLAIF) when you want to permanently improve the model.

Should I use absolute scoring or pairwise comparison?+

Pairwise comparison is more reliable and better calibrated (humans are better at saying 'A > B' than 'A is a 7'). Absolute scoring is needed when you don't have a reference to compare against (monitoring production quality). Use pairwise for model comparison tasks; use absolute scoring for production monitoring.

evals framework golden dataset rag evaluation regression testing ↗ llm eval harness ↗ qa testing agent ↗ content moderation

LLM-as-Judge: Using AI to Evaluate AI Outputs (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related