Why AI Agents Fail in Production (And How to Fix It)

AI agents work great in demos. They fail in production in predictable, fixable ways. After running agents in production and reviewing postmortems from teams across the industry, here are the six failure modes that kill most agent deployments — and how to fix each one.

Failure Mode 1: Tool Errors That Cascade

What Happens

A tool call fails. The model doesn't handle the error gracefully. It either gives up with a vague message, hallucinates a result, or — worse — calls the same failing tool in a retry loop until your API bill explodes.

# This agent has no error handling
def execute_tools(tool_calls, tool_functions):
    results = []
    for call in tool_calls:
        result = tool_functions[call.name](**call.input)  # Unhandled exception here
        results.append({"id": call.id, "content": result})
    return results

When tool_functions[call.name] raises a requests.Timeout, the whole agent crashes.

The Fix

Return errors as tool results, not exceptions. The model can reason about failures and adapt:

def execute_tools(tool_calls, tool_functions):
    results = []
    for call in tool_calls:
        try:
            result = tool_functions[call.name](**call.input)
            results.append({
                "type": "tool_result",
                "tool_use_id": call.id,
                "content": str(result)
            })
        except Exception as e:
            # Return error as a result — model can adapt
            results.append({
                "type": "tool_result",
                "tool_use_id": call.id,
                "content": f"Tool failed: {type(e).__name__}: {str(e)}. Please try a different approach.",
                "is_error": True
            })
    return results

Add to your system prompt:

If a tool fails, don't retry the same call. Try a different approach or
clearly explain to the user what you were unable to do and why.

Defense in Depth

Add timeouts to every tool call:

import signal
from contextlib import contextmanager

@contextmanager
def timeout(seconds: int):
    def handler(signum, frame):
        raise TimeoutError(f"Tool exceeded {seconds}s timeout")
    signal.signal(signal.SIGALRM, handler)
    signal.alarm(seconds)
    try:
        yield
    finally:
        signal.alarm(0)

def safe_tool_call(fn, timeout_secs=30, **kwargs):
    with timeout(timeout_secs):
        return fn(**kwargs)

Failure Mode 2: Infinite Loops

What Happens

The agent calls a tool. The tool returns something ambiguous. The model calls the tool again with slightly different parameters. This repeats until you hit your token limit or your credit card limit.

Real example: an agent tasked with "find the cheapest flight from NYC to London" keeps calling search_flights with different date combinations because no result is "clearly cheapest." 200 tool calls later, $40 spent, no result.

The Fix

1. Hard iteration limit:

MAX_ITERATIONS = 20
MAX_TOOL_CALLS = 50

tools_called = 0

for iteration in range(MAX_ITERATIONS):
    response = call_model(messages)
    
    if response.stop_reason == "end_turn":
        return extract_text(response)
    
    # Count tool calls
    new_calls = sum(1 for b in response.content if b.type == "tool_use")
    tools_called += new_calls
    
    if tools_called > MAX_TOOL_CALLS:
        # Force a final answer
        messages.append({
            "role": "user",
            "content": "You have made too many tool calls. Please summarize what you found so far and give your best answer."
        })
        final_response = call_model(messages)
        return extract_text(final_response)

raise RuntimeError(f"Agent exceeded {MAX_ITERATIONS} iterations")

2. Detect repeated tool calls:

from collections import Counter

tool_call_history = []

def detect_loop(call_name: str, call_input: dict) -> bool:
    call_sig = f"{call_name}:{sorted(call_input.items())}"
    tool_call_history.append(call_sig)
    
    # If same call made 3 times, we're in a loop
    if Counter(tool_call_history)[call_sig] >= 3:
        return True
    return False

# In your tool execution loop:
if detect_loop(call.name, call.input):
    results.append({
        "type": "tool_result",
        "tool_use_id": call.id,
        "content": "This tool has already been called with these exact parameters. Please proceed with what you know or ask for clarification."
    })
    continue

3. Progress requirement:

Add a progress tracker to your system prompt:

After every 5 tool calls, evaluate: have you made progress toward the goal?
If not, stop and explain what you've tried and what's blocking you.
Do not call the same tool with the same arguments more than 2 times.

Failure Mode 3: Context Window Overflow

What Happens

The agent accumulates tool results, conversation history, and system prompt. Eventually the conversation exceeds the context window. Either the API call fails with a context length error, or — more subtly — the model starts losing track of earlier information.

The Fix

1. Monitor context size actively:

import anthropic

tokenizer = anthropic.Anthropic()  # Use anthropic's token counting

def estimate_tokens(messages: list, system: str = "") -> int:
    # Rough estimate: 4 chars per token
    total = len(system)
    for msg in messages:
        if isinstance(msg["content"], str):
            total += len(msg["content"])
        elif isinstance(msg["content"], list):
            for block in msg["content"]:
                if isinstance(block, dict):
                    total += len(str(block))
    return total // 4

MAX_CONTEXT_TOKENS = 150_000  # Leave buffer for model's response

# Before each model call:
current_tokens = estimate_tokens(messages, system_prompt)
if current_tokens > MAX_CONTEXT_TOKENS:
    messages = compress_history(messages)

2. Compress tool results that aren't needed:

def compress_history(messages: list) -> list:
    """Summarize old tool results to free up context."""
    # Keep: system prompt, last N messages verbatim
    KEEP_RECENT = 10
    
    if len(messages) <= KEEP_RECENT:
        return messages
    
    old_messages = messages[:-KEEP_RECENT]
    recent_messages = messages[-KEEP_RECENT:]
    
    # Summarize old messages
    summary = summarize_with_cheap_model(old_messages)
    
    return [
        {"role": "user", "content": f"[Earlier context summary: {summary}]"},
        {"role": "assistant", "content": "Understood."},
        *recent_messages
    ]

def summarize_with_cheap_model(messages: list) -> str:
    """Use a cheap, fast model for summarization."""
    client = anthropic.Anthropic()
    response = client.messages.create(
        model="claude-haiku-3-5",  # Cheap and fast
        max_tokens=500,
        messages=[{
            "role": "user",
            "content": f"Summarize key facts from this conversation in 200 words: {str(messages)}"
        }]
    )
    return response.content[0].text

3. Truncate large tool results:

MAX_TOOL_RESULT_TOKENS = 4000  # ~16K characters

def truncate_result(result: str) -> str:
    MAX_CHARS = MAX_TOOL_RESULT_TOKENS * 4
    if len(result) <= MAX_CHARS:
        return result
    return result[:MAX_CHARS] + f"\n\n[Result truncated. Original length: {len(result)} chars]"

Failure Mode 4: Hallucinated Tool Calls

What Happens

The model invents tool parameters that don't exist, calls tools with wrong argument types, or calls tools it invented but don't exist. This manifests as KeyError, TypeError, or ValidationError at runtime.

Example: you define get_user(user_id: str). The model calls get_user(email="user@example.com") — an argument that doesn't exist in your schema.

The Fix

1. Strict input validation:

from pydantic import BaseModel, ValidationError
from typing import Any

class ToolCall(BaseModel):
    name: str
    input: dict

def validate_tool_call(call_name: str, call_input: dict, tool_schema: dict) -> tuple[bool, str]:
    """Validate tool call against schema before executing."""
    required = tool_schema.get("required", [])
    properties = tool_schema.get("properties", {})
    
    # Check for missing required args
    for req in required:
        if req not in call_input:
            return False, f"Missing required argument: '{req}'"
    
    # Check for unknown args
    for key in call_input:
        if key not in properties:
            return False, f"Unknown argument: '{key}'. Valid args: {list(properties.keys())}"
    
    # Type validation
    for key, value in call_input.items():
        expected_type = properties[key].get("type")
        if expected_type == "integer" and not isinstance(value, int):
            return False, f"Argument '{key}' must be integer, got {type(value).__name__}"
        if expected_type == "string" and not isinstance(value, str):
            return False, f"Argument '{key}' must be string, got {type(value).__name__}"
    
    return True, ""

# In your tool execution:
for call in tool_calls:
    schema = get_tool_schema(call.name)  # Your tool definitions
    is_valid, error_msg = validate_tool_call(call.name, call.input, schema)
    
    if not is_valid:
        results.append({
            "type": "tool_result",
            "tool_use_id": call.id,
            "content": f"Invalid tool call: {error_msg}. Please check the tool schema and try again.",
            "is_error": True
        })
        continue
    
    # Execute validated call
    result = tool_functions[call.name](**call.input)

Failure Mode 5: Stuck States

What Happens

The agent reaches a state where it can't make progress but doesn't know how to exit cleanly. It might keep generating empty responses, calling tools with null inputs, or outputting text that isn't a valid final answer.

The Fix

1. Stall detection:

def detect_stall(messages: list, window: int = 4) -> bool:
    """Detect if the agent is stuck in a non-progress loop."""
    if len(messages) < window * 2:
        return False
    
    recent = messages[-window:]
    assistant_outputs = [
        str(m.get("content", ""))
        for m in recent
        if m["role"] == "assistant"
    ]
    
    if len(assistant_outputs) < 2:
        return False
    
    # If last 2 assistant outputs are nearly identical, we're stuck
    if assistant_outputs[-1] == assistant_outputs[-2]:
        return True
    
    return False

# In your agent loop:
if detect_stall(messages):
    messages.append({
        "role": "user",
        "content": "You seem to be stuck. Please take a step back: what do you know so far? "
                   "What specifically is blocking you? Give your best answer with what you have."
    })

2. Escape hatch tool:

# Add a special "I'm done" tool
tools.append({
    "name": "task_complete",
    "description": "Call this when you have completed the task or cannot complete it. Required.",
    "input_schema": {
        "type": "object",
        "properties": {
            "status": {"type": "string", "enum": ["completed", "failed", "partial"]},
            "result": {"type": "string", "description": "Final answer or explanation"}
        },
        "required": ["status", "result"]
    }
})

Failure Mode 6: Cost Explosions

What Happens

An agent handles a tricky edge case by making 50 tool calls and consuming 200K tokens. This is fine for one request. If that edge case hits 1,000 users simultaneously at $0.50/request, you've just spent $500 on a single incident.

The Fix

1. Per-request cost tracking:

from dataclasses import dataclass, field

@dataclass
class CostTracker:
    input_tokens: int = 0
    output_tokens: int = 0
    MAX_INPUT = 50_000  # Per request
    MAX_OUTPUT = 10_000  # Per request
    INPUT_PRICE = 3.00 / 1_000_000   # Claude Sonnet 4
    OUTPUT_PRICE = 15.00 / 1_000_000
    
    def add(self, input_t: int, output_t: int):
        self.input_tokens += input_t
        self.output_tokens += output_t
    
    @property
    def cost(self) -> float:
        return (self.input_tokens * self.INPUT_PRICE +
                self.output_tokens * self.OUTPUT_PRICE)
    
    def check_limits(self) -> bool:
        return (self.input_tokens < self.MAX_INPUT and
                self.output_tokens < self.MAX_OUTPUT)

tracker = CostTracker()

for iteration in range(MAX_ITERATIONS):
    if not tracker.check_limits():
        # Force final answer
        return force_completion(messages)
    
    response = call_model(messages)
    tracker.add(response.usage.input_tokens, response.usage.output_tokens)
    
    # Log to your monitoring
    log_metric("agent.cost", tracker.cost)
    log_metric("agent.tokens", tracker.input_tokens + tracker.output_tokens)

2. User-level spending limits:

PER_USER_DAILY_LIMIT = 1.00  # $1 per user per day

def check_user_budget(user_id: str) -> bool:
    spent_today = get_user_spend_today(user_id)  # From your DB
    return spent_today < PER_USER_DAILY_LIMIT

# Before running agent:
if not check_user_budget(user_id):
    return "You've reached your daily AI usage limit. Resets at midnight UTC."

Testing Your Agent for Reliability

import pytest

# Test each failure mode explicitly

def test_handles_tool_failure():
    """Agent should recover gracefully from tool errors."""
    def broken_search(query: str):
        raise ConnectionError("Search API is down")
    
    result = run_agent(
        "Search for information about X",
        tools=[...],
        tool_functions={"search": broken_search}
    )
    
    # Should return a result, not raise an exception
    assert result is not None
    assert "error" in result.lower() or "unable" in result.lower()

def test_infinite_loop_prevention():
    """Agent should not loop forever on ambiguous tasks."""
    call_count = 0
    def ambiguous_tool(query: str):
        nonlocal call_count
        call_count += 1
        return "Result is unclear"
    
    result = run_agent(
        "Keep searching until you find the perfect answer",
        tool_functions={"search": ambiguous_tool}
    )
    
    assert call_count <= 10  # Should not loop forever
    assert result is not None

def test_cost_limits():
    """Agent should stop before exceeding cost limits."""
    # Simulate an expensive agent run
    result = run_agent(
        "Very complex task requiring many steps",
        max_cost=0.10  # $0.10 limit
    )
    
    # Should complete or fail gracefully, not exceed limit
    assert result.total_cost <= 0.12  # Allow small buffer

The Reliability Checklist

Before deploying an agent to production:

[ ] Every tool call wrapped in try/except, returning errors as results
[ ] Hard iteration limit (never more than N steps)
[ ] Duplicate tool call detection
[ ] Context size monitoring with compression fallback
[ ] Tool input validation before execution
[ ] Per-request cost tracking with budget enforcement
[ ] Stall detection that forces completion
[ ] Tested with all 6 failure modes above
[ ] Alerting on cost spikes and error rates
[ ] Human review capability for high-cost requests

Agents that survive production have all of these. Agents that don't have been burned by at least one of them.

Why AI Agents Fail in Production (And How to Fix It)

Failure Mode 1: Tool Errors That Cascade

What Happens

The Fix

Defense in Depth

Failure Mode 2: Infinite Loops

What Happens

The Fix

Failure Mode 3: Context Window Overflow

What Happens

The Fix

Failure Mode 4: Hallucinated Tool Calls

What Happens

The Fix

Failure Mode 5: Stuck States

What Happens

The Fix

Failure Mode 6: Cost Explosions

What Happens

The Fix

Testing Your Agent for Reliability

The Reliability Checklist

Related Tools