Why AI Agents Fail in Production (And How to Fix It)
AI agents work great in demos. They fail in production in predictable, fixable ways. After running agents in production and reviewing postmortems from teams across the industry, here are the six failure modes that kill most agent deployments — and how to fix each one.
Failure Mode 1: Tool Errors That Cascade
What Happens
A tool call fails. The model doesn't handle the error gracefully. It either gives up with a vague message, hallucinates a result, or — worse — calls the same failing tool in a retry loop until your API bill explodes.
# This agent has no error handling
def execute_tools(tool_calls, tool_functions):
results = []
for call in tool_calls:
result = tool_functions[call.name](**call.input) # Unhandled exception here
results.append({"id": call.id, "content": result})
return results
When tool_functions[call.name] raises a requests.Timeout, the whole agent crashes.
The Fix
Return errors as tool results, not exceptions. The model can reason about failures and adapt:
def execute_tools(tool_calls, tool_functions):
results = []
for call in tool_calls:
try:
result = tool_functions[call.name](**call.input)
results.append({
"type": "tool_result",
"tool_use_id": call.id,
"content": str(result)
})
except Exception as e:
# Return error as a result — model can adapt
results.append({
"type": "tool_result",
"tool_use_id": call.id,
"content": f"Tool failed: {type(e).__name__}: {str(e)}. Please try a different approach.",
"is_error": True
})
return results
Add to your system prompt:
If a tool fails, don't retry the same call. Try a different approach or
clearly explain to the user what you were unable to do and why.
Defense in Depth
Add timeouts to every tool call:
import signal
from contextlib import contextmanager
@contextmanager
def timeout(seconds: int):
def handler(signum, frame):
raise TimeoutError(f"Tool exceeded {seconds}s timeout")
signal.signal(signal.SIGALRM, handler)
signal.alarm(seconds)
try:
yield
finally:
signal.alarm(0)
def safe_tool_call(fn, timeout_secs=30, **kwargs):
with timeout(timeout_secs):
return fn(**kwargs)
Failure Mode 2: Infinite Loops
What Happens
The agent calls a tool. The tool returns something ambiguous. The model calls the tool again with slightly different parameters. This repeats until you hit your token limit or your credit card limit.
Real example: an agent tasked with "find the cheapest flight from NYC to London" keeps calling search_flights with different date combinations because no result is "clearly cheapest." 200 tool calls later, $40 spent, no result.
The Fix
1. Hard iteration limit:
MAX_ITERATIONS = 20
MAX_TOOL_CALLS = 50
tools_called = 0
for iteration in range(MAX_ITERATIONS):
response = call_model(messages)
if response.stop_reason == "end_turn":
return extract_text(response)
# Count tool calls
new_calls = sum(1 for b in response.content if b.type == "tool_use")
tools_called += new_calls
if tools_called > MAX_TOOL_CALLS:
# Force a final answer
messages.append({
"role": "user",
"content": "You have made too many tool calls. Please summarize what you found so far and give your best answer."
})
final_response = call_model(messages)
return extract_text(final_response)
raise RuntimeError(f"Agent exceeded {MAX_ITERATIONS} iterations")
2. Detect repeated tool calls:
from collections import Counter
tool_call_history = []
def detect_loop(call_name: str, call_input: dict) -> bool:
call_sig = f"{call_name}:{sorted(call_input.items())}"
tool_call_history.append(call_sig)
# If same call made 3 times, we're in a loop
if Counter(tool_call_history)[call_sig] >= 3:
return True
return False
# In your tool execution loop:
if detect_loop(call.name, call.input):
results.append({
"type": "tool_result",
"tool_use_id": call.id,
"content": "This tool has already been called with these exact parameters. Please proceed with what you know or ask for clarification."
})
continue
3. Progress requirement:
Add a progress tracker to your system prompt:
After every 5 tool calls, evaluate: have you made progress toward the goal?
If not, stop and explain what you've tried and what's blocking you.
Do not call the same tool with the same arguments more than 2 times.
Failure Mode 3: Context Window Overflow
What Happens
The agent accumulates tool results, conversation history, and system prompt. Eventually the conversation exceeds the context window. Either the API call fails with a context length error, or — more subtly — the model starts losing track of earlier information.
The Fix
1. Monitor context size actively:
import anthropic
tokenizer = anthropic.Anthropic() # Use anthropic's token counting
def estimate_tokens(messages: list, system: str = "") -> int:
# Rough estimate: 4 chars per token
total = len(system)
for msg in messages:
if isinstance(msg["content"], str):
total += len(msg["content"])
elif isinstance(msg["content"], list):
for block in msg["content"]:
if isinstance(block, dict):
total += len(str(block))
return total // 4
MAX_CONTEXT_TOKENS = 150_000 # Leave buffer for model's response
# Before each model call:
current_tokens = estimate_tokens(messages, system_prompt)
if current_tokens > MAX_CONTEXT_TOKENS:
messages = compress_history(messages)
2. Compress tool results that aren't needed:
def compress_history(messages: list) -> list:
"""Summarize old tool results to free up context."""
# Keep: system prompt, last N messages verbatim
KEEP_RECENT = 10
if len(messages) <= KEEP_RECENT:
return messages
old_messages = messages[:-KEEP_RECENT]
recent_messages = messages[-KEEP_RECENT:]
# Summarize old messages
summary = summarize_with_cheap_model(old_messages)
return [
{"role": "user", "content": f"[Earlier context summary: {summary}]"},
{"role": "assistant", "content": "Understood."},
*recent_messages
]
def summarize_with_cheap_model(messages: list) -> str:
"""Use a cheap, fast model for summarization."""
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-haiku-3-5", # Cheap and fast
max_tokens=500,
messages=[{
"role": "user",
"content": f"Summarize key facts from this conversation in 200 words: {str(messages)}"
}]
)
return response.content[0].text
3. Truncate large tool results:
MAX_TOOL_RESULT_TOKENS = 4000 # ~16K characters
def truncate_result(result: str) -> str:
MAX_CHARS = MAX_TOOL_RESULT_TOKENS * 4
if len(result) <= MAX_CHARS:
return result
return result[:MAX_CHARS] + f"\n\n[Result truncated. Original length: {len(result)} chars]"
Failure Mode 4: Hallucinated Tool Calls
What Happens
The model invents tool parameters that don't exist, calls tools with wrong argument types, or calls tools it invented but don't exist. This manifests as KeyError, TypeError, or ValidationError at runtime.
Example: you define get_user(user_id: str). The model calls get_user(email="user@example.com") — an argument that doesn't exist in your schema.
The Fix
1. Strict input validation:
from pydantic import BaseModel, ValidationError
from typing import Any
class ToolCall(BaseModel):
name: str
input: dict
def validate_tool_call(call_name: str, call_input: dict, tool_schema: dict) -> tuple[bool, str]:
"""Validate tool call against schema before executing."""
required = tool_schema.get("required", [])
properties = tool_schema.get("properties", {})
# Check for missing required args
for req in required:
if req not in call_input:
return False, f"Missing required argument: '{req}'"
# Check for unknown args
for key in call_input:
if key not in properties:
return False, f"Unknown argument: '{key}'. Valid args: {list(properties.keys())}"
# Type validation
for key, value in call_input.items():
expected_type = properties[key].get("type")
if expected_type == "integer" and not isinstance(value, int):
return False, f"Argument '{key}' must be integer, got {type(value).__name__}"
if expected_type == "string" and not isinstance(value, str):
return False, f"Argument '{key}' must be string, got {type(value).__name__}"
return True, ""
# In your tool execution:
for call in tool_calls:
schema = get_tool_schema(call.name) # Your tool definitions
is_valid, error_msg = validate_tool_call(call.name, call.input, schema)
if not is_valid:
results.append({
"type": "tool_result",
"tool_use_id": call.id,
"content": f"Invalid tool call: {error_msg}. Please check the tool schema and try again.",
"is_error": True
})
continue
# Execute validated call
result = tool_functions[call.name](**call.input)
Failure Mode 5: Stuck States
What Happens
The agent reaches a state where it can't make progress but doesn't know how to exit cleanly. It might keep generating empty responses, calling tools with null inputs, or outputting text that isn't a valid final answer.
The Fix
1. Stall detection:
def detect_stall(messages: list, window: int = 4) -> bool:
"""Detect if the agent is stuck in a non-progress loop."""
if len(messages) < window * 2:
return False
recent = messages[-window:]
assistant_outputs = [
str(m.get("content", ""))
for m in recent
if m["role"] == "assistant"
]
if len(assistant_outputs) < 2:
return False
# If last 2 assistant outputs are nearly identical, we're stuck
if assistant_outputs[-1] == assistant_outputs[-2]:
return True
return False
# In your agent loop:
if detect_stall(messages):
messages.append({
"role": "user",
"content": "You seem to be stuck. Please take a step back: what do you know so far? "
"What specifically is blocking you? Give your best answer with what you have."
})
2. Escape hatch tool:
# Add a special "I'm done" tool
tools.append({
"name": "task_complete",
"description": "Call this when you have completed the task or cannot complete it. Required.",
"input_schema": {
"type": "object",
"properties": {
"status": {"type": "string", "enum": ["completed", "failed", "partial"]},
"result": {"type": "string", "description": "Final answer or explanation"}
},
"required": ["status", "result"]
}
})
Failure Mode 6: Cost Explosions
What Happens
An agent handles a tricky edge case by making 50 tool calls and consuming 200K tokens. This is fine for one request. If that edge case hits 1,000 users simultaneously at $0.50/request, you've just spent $500 on a single incident.
The Fix
1. Per-request cost tracking:
from dataclasses import dataclass, field
@dataclass
class CostTracker:
input_tokens: int = 0
output_tokens: int = 0
MAX_INPUT = 50_000 # Per request
MAX_OUTPUT = 10_000 # Per request
INPUT_PRICE = 3.00 / 1_000_000 # Claude Sonnet 4
OUTPUT_PRICE = 15.00 / 1_000_000
def add(self, input_t: int, output_t: int):
self.input_tokens += input_t
self.output_tokens += output_t
@property
def cost(self) -> float:
return (self.input_tokens * self.INPUT_PRICE +
self.output_tokens * self.OUTPUT_PRICE)
def check_limits(self) -> bool:
return (self.input_tokens < self.MAX_INPUT and
self.output_tokens < self.MAX_OUTPUT)
tracker = CostTracker()
for iteration in range(MAX_ITERATIONS):
if not tracker.check_limits():
# Force final answer
return force_completion(messages)
response = call_model(messages)
tracker.add(response.usage.input_tokens, response.usage.output_tokens)
# Log to your monitoring
log_metric("agent.cost", tracker.cost)
log_metric("agent.tokens", tracker.input_tokens + tracker.output_tokens)
2. User-level spending limits:
PER_USER_DAILY_LIMIT = 1.00 # $1 per user per day
def check_user_budget(user_id: str) -> bool:
spent_today = get_user_spend_today(user_id) # From your DB
return spent_today < PER_USER_DAILY_LIMIT
# Before running agent:
if not check_user_budget(user_id):
return "You've reached your daily AI usage limit. Resets at midnight UTC."
Testing Your Agent for Reliability
import pytest
# Test each failure mode explicitly
def test_handles_tool_failure():
"""Agent should recover gracefully from tool errors."""
def broken_search(query: str):
raise ConnectionError("Search API is down")
result = run_agent(
"Search for information about X",
tools=[...],
tool_functions={"search": broken_search}
)
# Should return a result, not raise an exception
assert result is not None
assert "error" in result.lower() or "unable" in result.lower()
def test_infinite_loop_prevention():
"""Agent should not loop forever on ambiguous tasks."""
call_count = 0
def ambiguous_tool(query: str):
nonlocal call_count
call_count += 1
return "Result is unclear"
result = run_agent(
"Keep searching until you find the perfect answer",
tool_functions={"search": ambiguous_tool}
)
assert call_count <= 10 # Should not loop forever
assert result is not None
def test_cost_limits():
"""Agent should stop before exceeding cost limits."""
# Simulate an expensive agent run
result = run_agent(
"Very complex task requiring many steps",
max_cost=0.10 # $0.10 limit
)
# Should complete or fail gracefully, not exceed limit
assert result.total_cost <= 0.12 # Allow small buffer
The Reliability Checklist
Before deploying an agent to production:
- [ ] Every tool call wrapped in try/except, returning errors as results
- [ ] Hard iteration limit (never more than N steps)
- [ ] Duplicate tool call detection
- [ ] Context size monitoring with compression fallback
- [ ] Tool input validation before execution
- [ ] Per-request cost tracking with budget enforcement
- [ ] Stall detection that forces completion
- [ ] Tested with all 6 failure modes above
- [ ] Alerting on cost spikes and error rates
- [ ] Human review capability for high-cost requests
Agents that survive production have all of these. Agents that don't have been burned by at least one of them.