agentsadvanced

Agent Evaluation: Measuring and Improving Agent Performance (2026)

Quick Answer

Agent evaluation measures task completion rate, trajectory efficiency (steps taken vs. minimum steps), and decision quality at each step. Unlike single-turn LLM eval, agent eval must handle stochastic trajectories where there are multiple valid paths to success. The gold standard is end-to-end task completion on a diverse benchmark set, supplemented by step-level LLM-as-judge scoring for quality.

When to Use

✓Before deploying an agent to production — validate task completion rate on representative tasks
✓After modifying the agent's tools, prompts, or model — regression testing to catch degradations
✓Debugging a failing agent — trajectory evaluation pinpoints which step the agent goes wrong
✓Comparing two agent architectures or models to select the better one
✓Setting SLAs for agent performance — you need measured baselines before you can commit to uptime metrics

How It Works

1Define success criteria: for each task in your benchmark, define what constitutes success (correct final answer, action taken, file created, API called with right parameters). Success criteria must be verifiable programmatically or by LLM judge.
2Build a benchmark suite: 50–200 tasks across difficulty levels. Include easy tasks (agent should always pass), medium tasks (agent should pass 70%+), and hard tasks (agent should attempt but may fail). Balance task types across your use cases.
3Measure task completion rate: run each task 3–5 times (agents are stochastic) and report pass@1, pass@3, and pass@5. Pass@k means at least one of k runs succeeded. Report both means and variance.
4Trajectory evaluation: for each task run, record the full action sequence. Compare to an optimal trajectory: extra steps = inefficiency; wrong actions = errors; loops = bugs. Use LLM-as-judge to score trajectory quality on a 1-5 scale.
5CI integration: run the benchmark (or a fast subset) on every code change. Alert on >5% drop in task completion rate. This is 'eval-driven development' — changes must not regress benchmarks.

Examples

Task evaluation harness

import asyncio
from dataclasses import dataclass
from typing import Callable

@dataclass
class AgentTask:
    task_id: str
    description: str
    expected_outcome: dict  # Verifiable success condition
    max_steps: int = 15
    difficulty: str = 'medium'

async def evaluate_task(agent, task: AgentTask, n_runs: int = 3) -> dict:
    results = []
    for run in range(n_runs):
        trajectory = await agent.run(task.description, max_steps=task.max_steps)
        success = verify_outcome(trajectory.final_state, task.expected_outcome)
        results.append({
            'run': run,
            'success': success,
            'steps': len(trajectory.steps),
            'cost': trajectory.total_cost,
            'trajectory': trajectory.steps
        })
    
    return {
        'task_id': task.task_id,
        'pass_at_1': results[0]['success'],
        'pass_at_3': any(r['success'] for r in results),
        'mean_steps': sum(r['steps'] for r in results) / n_runs,
        'success_rate': sum(r['success'] for r in results) / n_runs
    }

Output:Returns per-task metrics. Aggregate across all tasks for benchmark scores. Run in parallel with asyncio.gather() for speed — 100 tasks × 3 runs = 300 agent executions, parallelizable to ~10 minutes.

LLM judge for trajectory quality

TRAJECTORY_JUDGE_PROMPT = '''
Evaluate this agent trajectory for a {task_description} task.

Trajectory:
{formatted_trajectory}

Final outcome: {outcome}

Score on each dimension (1-5):
1. Efficiency: Did the agent take the most direct path? (1=many wasted steps, 5=minimal steps)
2. Correctness: Was each action the right choice given available information? (1=many wrong actions, 5=all correct)
3. Recovery: Did the agent recover well from errors? (1=crashed/looped, 5=graceful recovery)
4. Completion: Did the agent fully complete the task? (1=failed, 5=fully complete)

Return JSON: {"efficiency": N, "correctness": N, "recovery": N, "completion": N, "notes": "..."}'''

# Use GPT-4o or Claude as judge (different from the agent being evaluated)

Output:Trajectory judge gives qualitative signals beyond pass/fail. Low efficiency scores indicate planning problems. Low correctness scores indicate prompt or tool description problems. Low recovery scores indicate error handling problems.

Common Mistakes

✗Evaluating only final output, not trajectory — an agent that reaches the right answer via a circuitous, expensive path is worse than one that does it efficiently. Always measure steps taken, tools called, and cost alongside correctness.
✗Using the same LLM for the agent and the evaluator — self-evaluation has bias. Use a different model (different provider if possible) as the judge. An agent built on Claude should be evaluated by GPT-4o as judge, and vice versa.
✗Benchmark tasks that are too similar — a benchmark of 100 variations of the same task type doesn't measure generalization. Include diverse task types, edge cases, and intentionally hard tasks that the agent is expected to fail.
✗Not measuring variance — running each task only once gives unreliable metrics due to LLM stochasticity. Run each task 3–5 times. An agent with 70% pass rate on 1 run may actually have 90% pass rate at 3 runs (pass@3).

FAQ

What's a good task completion rate for a production agent?+

Depends on task complexity and tolerance for failure. For simple, well-defined tasks (data lookup, form filling): target >90% pass@1. For complex, multi-step tasks: target >70% pass@1, >90% pass@3. Below 60% pass@1, the agent shouldn't be in production — users will perceive it as broken.

How do I evaluate agents in environments where actions have real side effects?+

Use sandboxed evaluation environments: mock APIs that simulate side effects without executing them, test database instances, shadow-mode deployment (agent runs but results are reviewed before applying). Never run evaluation tasks on production systems with real data.

What's SWE-Bench and should I benchmark on it?+

SWE-Bench is a benchmark of real GitHub issues requiring code changes to fix. It's the standard for coding agents. If your agent does code tasks, SWE-Bench resolved rate is a meaningful metric for comparing against published results. In 2026, top agents achieve 50-70% on SWE-Bench Verified. For other agent types, build domain-specific benchmarks.

How do I handle non-determinism in agent evaluation?+

Three approaches: (1) Use temperature=0 for deterministic evaluation (easier to debug, less representative of production). (2) Run N trials and report pass@k metrics (more representative, more expensive). (3) Use fixed seeds if the model API supports them. For production benchmarks, use N=3 trials with production temperature settings.

What tools exist for agent evaluation?+

LangSmith (LangChain) provides agent tracing and evaluation with LLM judges. Braintrust supports agent evals with scoring functions. AgentBench provides standardized benchmarks. For custom agents, build evaluation harnesses directly — they're not complex and give you full control over success criteria.

evals framework llm as judge agent planning regression testing ↗ llm eval harness ↗ qa testing agent ↗ multi agent orchestration

Agent Evaluation: Measuring and Improving Agent Performance (2026)

When to Use

How It Works

Examples

Common Mistakes

FAQ

Related