agentsintermediate

Error Recovery in LLM Agents (2026)

Quick Answer

LLM agents need explicit error recovery because they encounter: tool failures (API timeouts, invalid inputs), model errors (hallucinated arguments, invalid JSON), and logic loops (repeating the same tool call). The pattern is: catch errors, return meaningful error messages as tool results, allow the agent up to 2 retries with different approaches, and escalate to a fallback or human handoff after repeated failures.

When to Use

  • Building any production agent that calls external APIs which can fail or timeout
  • Agents that generate structured outputs (JSON, code) that need validation before use
  • Multi-step workflows where a partial failure should produce a partial result, not a full crash
  • High-stakes agents (financial, legal) where silent failures are worse than explicit errors
  • Any agent running in an unmonitored production environment where issues need to be self-healable

How It Works

  1. 1Classify errors by recovery strategy: retryable (network timeout, rate limit), fixable (wrong argument format, missing required field), and fatal (invalid credentials, user not found). Handle each class differently.
  2. 2Return structured error messages as tool results, not Python tracebacks: {"error": "INVALID_PARAMETER", "field": "date", "message": "Date must be in YYYY-MM-DD format", "received": "04/16/2026"}. The model can understand and fix structured errors; it can't parse stack traces.
  3. 3Allow retry with backoff: for retryable errors, retry 2-3 times with exponential backoff (1s, 4s, 16s). Tell the model about the retry: 'Previous call failed with timeout. Retrying...'
  4. 4Implement a fallback path: if the primary approach fails 3 times, route to a simplified path (fewer tools, more constrained prompt) or return a partial result with a clear error message.
  5. 5Add a global step counter and exit strategy: if the agent exceeds N steps without reaching a final answer, emit a structured error response with the intermediate state captured so far.

Examples

Structured error return from tool
def safe_tool_executor(tool_name: str, tool_args: dict) -> dict:
    try:
        # Validate arguments before calling
        validated_args = validate_tool_args(tool_name, tool_args)
        result = tools[tool_name](**validated_args)
        return {'status': 'success', 'data': result}
    
    except ValidationError as e:
        return {
            'status': 'error',
            'error_type': 'INVALID_ARGUMENTS',
            'details': str(e),
            'hint': f'Valid arguments for {tool_name}: {get_tool_schema(tool_name)}'
        }
    except TimeoutError:
        return {
            'status': 'error',
            'error_type': 'TIMEOUT',
            'hint': 'The service is temporarily unavailable. Try a different approach or wait 30 seconds.'
        }
    except Exception as e:
        # Log the real error, return safe message
        logger.error(f'Tool {tool_name} failed: {e}', exc_info=True)
        return {
            'status': 'error',
            'error_type': 'UNKNOWN',
            'hint': 'The tool encountered an unexpected error. Consider an alternative approach.'
        }
Output:Returns structured errors the model can reason about. The 'hint' field guides recovery. Never expose raw exceptions — they reveal implementation details and confuse the model.
Agent retry loop with escalation
MAX_ATTEMPTS = 3
MAX_STEPS = 15

def robust_agent(task: str, tools: list) -> dict:
    attempts = 0
    step_count = 0
    last_error = None
    
    while attempts < MAX_ATTEMPTS:
        try:
            result = run_agent_loop(
                task=task,
                tools=tools,
                max_steps=MAX_STEPS,
                prior_error=last_error
            )
            return {'status': 'success', 'result': result}
        
        except LoopDetectedError as e:
            last_error = f'Agent looped on: {e.repeated_action}. Try a different approach.'
            attempts += 1
        except MaxStepsError:
            return {'status': 'partial', 'result': e.intermediate_state, 'error': 'Task too complex'}
    
    return {
        'status': 'failed',
        'error': 'Max retries exceeded',
        'last_error': last_error,
        'escalate_to_human': True
    }
Output:3-attempt outer loop with loop detection and escalation. LoopDetectedError is raised when the same tool is called with identical arguments twice. Human escalation is the final fallback.

Common Mistakes

  • Returning empty tool results on error — an empty string or null as tool_result causes the model to hallucinate what the tool returned. Always return a descriptive error message that explicitly states what went wrong.
  • No loop detection — agents can enter infinite loops calling the same failing tool repeatedly. Track (tool_name, argument_hash) across steps and raise an error if the same call appears twice.
  • Silently swallowing errors — catching all exceptions and returning 'success' hides failures that need human attention. Log all tool failures with full context; return structured errors to the model.
  • Over-retrying expensive operations — retrying a $0.50 LLM call 3 times because of a transient error can get expensive in high-volume systems. Implement circuit breakers and retry only for cheap, idempotent operations.

FAQ

How should the agent communicate errors to the user?+

Distinguish between partial success ('I found some information but couldn't complete the pricing lookup due to an API error'), graceful failure ('I wasn't able to complete this task — here's what I tried and why it failed'), and hard failure ('An unexpected error occurred. Please try again'). Never expose technical error details to end users.

What's the right number of retries?+

For transient errors (network timeouts, rate limits): 2-3 retries with exponential backoff. For model errors (bad JSON, wrong format): 1-2 retries with explicit correction guidance. Never retry errors that require different logic (wrong credentials, resource not found) — retrying won't help.

How do I detect when an agent is stuck in a loop?+

Track a hash of (tool_name, sorted(tool_args)) for each step. If the same hash appears twice in the current run, the agent is looping. Inject a recovery message: 'You already tried [action] and it produced [result]. Please try a different approach.' If it loops a third time, escalate.

How should partial results be handled?+

For multi-step tasks where some steps succeed and some fail: return the successful partial result with a clear indication of what's missing. Don't discard successful work just because one step failed. Structure the response: {success_data: {...}, missing_data: ['pricing', 'inventory'], reason: 'API timeout on inventory check'}.

When should I escalate to a human?+

Escalate when: (1) repeated retries fail on the same step, (2) the agent's confidence in its output is explicitly low, (3) the task involves irreversible actions (delete, payment, send email) and validation failed, or (4) the user has requested human review. Build explicit escalation paths into your agent architecture, not just as an afterthought.

Related