Reference Architecture · agent
LLM Function Calling & Tool Use: Production Architecture
Last updated: April 16, 2026
Quick answer
Use native function calling (Anthropic tool_use, OpenAI functions) for structured, validated tool dispatch. Design schemas with strict required/optional fields and enum constraints. Implement a max-tool-calls guard (10 per session). On tool failure, return a structured error result — never skip the tool result, or the model will hallucinate. Parallel tool calls reduce latency by 50-70% when multiple tools can run concurrently.
The problem
Naive tool-use implementations fail in production because tool schemas are underspecified (causing hallucinated arguments), error handling is absent (one tool failure breaks the entire chain), and there are no guards against the model calling tools indefinitely. Teams report 15-25% of tool-augmented LLM requests fail silently — the model calls a tool, gets an error, then either hallucinates a result or loops infinitely.
Architecture
User Request
The user query that requires tool augmentation. Can be a natural language request ('What is the weather in Tokyo?') or a structured task requiring multiple tool calls ('Compare the stock performance of AAPL and MSFT this week and summarize the news').
Alternatives: API request, Slack message, Programmatic task
Tool Schema Registry
Centralized store of all tool definitions in the provider's JSON Schema format. Versioned alongside code. At request time, selects the relevant subset of tools to include in the LLM context (never include all tools — inject only what's needed for the task to reduce context size).
Alternatives: OpenAPI spec (converted to tool schemas), Custom YAML registry, LangChain tool definitions
LLM with Tool Use
The LLM that decides which tools to call, with what arguments, and in what order. Modern LLMs support parallel tool calls — multiple tool_use blocks in a single response. The model continues calling tools until it decides to generate a final text response.
Alternatives: gpt-4o, gemini-2-flash, claude-haiku-4 (for simpler tool decisions), Mistral Large 2
Tool Call Router
Receives the LLM's tool_use blocks, validates arguments against the schema, routes each call to the appropriate tool executor, and runs parallel calls concurrently. Enforces argument validation before execution to catch schema violations before they cause side effects.
Alternatives: LangChain AgentExecutor, LlamaIndex tool routing, OpenAI Assistants thread runner
Search Tool
Web or database search tool. Takes a query string, returns structured results. Common: Tavily Search, Brave Search, or custom vector DB search. Return format matters — LLMs perform better with concise, structured results vs raw HTML.
Alternatives: Brave Search API, Exa AI, SerpAPI, Custom Postgres FTS
Code Execution Tool
Runs Python or JavaScript in a sandboxed environment. Returns stdout, stderr, and generated files. Critical: always sandbox — never execute model-generated code on your production server.
Alternatives: Modal sandboxes, Pyodide (browser-based), AWS Lambda + containers
External API Tool
Calls external REST APIs (weather, stocks, CRM, calendar). Arguments from the LLM are validated against the schema, then injected into the API request. Auth credentials are never exposed in tool schemas — injected server-side.
Alternatives: Zapier (no-code), Make.com, Composio (managed tools)
Max Tool Calls Guard
Tracks the number of tool calls in the current session. If the model exceeds the limit (default: 10 calls per user request), abort the loop and return the partial result with a 'max tools reached' message. Prevents runaway costs from models stuck in tool-call loops.
Alternatives: LangGraph recursion_limit, Custom middleware counter
Tool Result Injector
Formats each tool execution result as a `tool_result` message block and injects it back into the conversation thread. CRITICAL: every tool_use block MUST have a corresponding tool_result — if you omit any, the API returns a validation error. Failed tools return error results, not empty results.
Alternatives: Anthropic SDK message building helpers, OpenAI function result objects
Final LLM Response
After all tools have completed, the LLM synthesizes a final text response incorporating tool results. This response should be streamed to the client for UX. May include formatted data, summaries, or structured outputs.
Alternatives: Streamed text response, Structured JSON (via response_format), Rendered markdown
The stack
Claude Sonnet 4 has the strongest tool schema adherence in benchmarks — 95%+ schema compliance vs 88-92% for GPT-4o on complex nested schemas. It also supports parallel tool calls natively and has a 200K context window for injecting large tool results.
Alternatives: OpenAI GPT-4o, Google Gemini 2.5 Pro, Mistral Large 2
Define tool schemas in TypeScript/Zod and compile to JSON Schema at build time. This gives you: type safety in your tool executor (validate args before execution), auto-generated JSON Schema for the LLM API, and runtime validation that catches schema drift. Never write JSON Schema by hand — one typo makes the schema invalid and the LLM silently falls back to hallucinated args.
Alternatives: JSON Schema directly, OpenAI function definitions, Pydantic (Python)
When the LLM returns multiple tool_use blocks, execute them in parallel with Promise.all(). For 3 tools that each take 500ms sequentially (1500ms total), parallel execution takes 500ms — a 3x latency reduction. Use Promise.allSettled() instead of Promise.all() so a single tool failure doesn't abort all parallel calls.
Alternatives: Promise.allSettled() for error isolation, Sequential execution (fallback), Worker threads for CPU-intensive tools
Validate all LLM-supplied arguments before execution — models hallucinate invalid argument types or values ~5-10% of the time. `safeParse()` returns an error result (which gets returned as a tool_result error) instead of throwing and crashing your server. Validation cost: <0.1ms per call.
Alternatives: JSON Schema Validator (AJV), Manual argument checking
Anthropic's API requires a tool_result for every tool_use block. If you omit it, you get a 400 error. If a tool fails, return: `{type: 'tool_result', tool_use_id: '...', is_error: true, content: 'Error: API returned 404'}`. The model will then either try a different approach or explain to the user what went wrong — much better than a crash.
Alternatives: Retry the tool call once, Fallback to alternative tool
Without a loop guard, a model stuck in a tool-calling loop (e.g., search → not found → search → not found...) can run up $5-50 in a single user session. 10 tool calls covers 99% of legitimate use cases. At call 10, return whatever partial result you have with a note that the tool limit was reached.
Alternatives: LangGraph recursion_limit, Time-based session timeout, Token budget tracking
Tool results consume input tokens — at $3/M for Claude Sonnet 4, a 10K token tool result costs $0.03 per call, and with 10 tool calls per session that's $0.30 in tool-result tokens alone. Compress results: for web search, return {title, url, snippet} not full HTML. For API responses, extract only the fields specified in the tool's output_schema.
Alternatives: Raw HTML (for web search — avoid this), Full API response payload
Cost at each scale
Prototype
1,000 requests/mo, avg 3 tool calls each
$35/mo
Growth
50,000 requests/mo, avg 4 tool calls each
$1,850/mo
Scale
1M requests/mo, avg 5 tool calls each
$32,000/mo
Latency budget
Tradeoffs
Failure modes & guardrails
Mitigation: The model repeatedly calls the same tool with slightly different arguments hoping for a different result. Implement a session-level counter (max 10 calls) and a deduplication check: if the same (tool_name, args_hash) pair appears twice in a session, skip execution and return 'duplicate call detected'. Break the loop by returning 'max tools reached' and ask the model to synthesize a best-effort answer from what it has.
Mitigation: Omitting a tool_result for any tool_use block causes Anthropic API 400 errors ('all tool_use blocks must have a corresponding tool_result'). Fix: use a try-catch wrapper around every tool executor that always returns a result — either success or error. Track outstanding tool_use IDs in a Map and assert all are fulfilled before sending the next message.
Mitigation: The LLM invents argument values not present in the user's input (e.g., making up a zip code for a weather lookup). Prevention: add explicit descriptions to schema properties explaining where the value should come from. If required arguments cannot be inferred from context, use `ask_user` tool pattern — define a tool that signals 'I need more information from the user' instead of hallucinating.
Mitigation: External tools (search, weather, CRM) have their own rate limits. When a tool returns a rate-limit error, the model often retries the same call — compounding the problem. Fix: implement exponential backoff in the tool executor (not visible to the model), cache identical tool calls within a session (same args → same result for 60 seconds), and return rate-limit errors as tool_result errors with a `retry_after` hint in the message.
Frequently asked questions
How do I design tool schemas that the LLM fills correctly?
Write tool descriptions as if explaining to a junior developer: specify what the tool does, what each argument represents, what valid values look like, and what the tool returns. Use enum constraints wherever possible (don't let the model guess the unit system — specify `'unit': {'enum': ['celsius', 'fahrenheit']}`). Add a `usage_examples` field (non-standard but models read it) with 1-2 example invocations. Test schemas by calling your LLM with adversarial inputs and checking argument quality.
Should I use native function calling or ask the model to output JSON?
Use native function calling for production systems. Native tool_use gives you: (1) provider-level schema validation before the API returns, catching format errors before they reach your code; (2) clear separation between reasoning (text) and action (tool_use blocks); (3) parallel tool call support; (4) explicit tool_result injection pattern. Manual JSON parsing works for simple cases but is fragile — models sometimes wrap JSON in markdown code blocks, add extra fields, or produce trailing commas that break JSON.parse().
How do I handle a tool call that requires user confirmation?
Implement a 'pause_for_confirmation' tool that the model calls instead of directly executing a destructive action (send email, delete record, make purchase). This tool returns a confirmation prompt to the client. The client displays the confirmation UI, and the user's approval/rejection is injected as the tool_result. This pattern keeps the model in the loop about user decisions without giving it unilateral authority over destructive actions.
When should I use Claude's tool use vs OpenAI's function calling?
Both are functionally equivalent for most use cases. Choose based on your primary model preference. Key differences: Anthropic requires tool_results for ALL tool_use blocks (OpenAI is more lenient); Anthropic's parallel tool calls return all tool_use blocks in one response turn (OpenAI does the same); Claude tends to have better schema adherence on complex nested types. If you need to support multiple providers, use the Vercel AI SDK or LiteLLM which abstracts the provider-specific format.
Related
Architectures
Multi-Agent Orchestration: Supervisor-Worker Architecture
A production pattern for coordinating multiple specialized LLM agents under a supervisor that delegates tasks,...
Data Analyst Agent
Reference architecture for a natural-language-to-SQL agent that queries tabular data, generates charts, and pr...
Calendar Scheduling Agent
Reference architecture for an agent that parses availability, books meetings across timezones, and handles res...
Text-to-SQL Agent
Reference architecture for translating natural-language questions into safe, correct SQL. Schema-aware prompti...
Token Streaming Pipeline: LLM to UI at Scale
Production architecture for streaming LLM tokens to web and mobile clients using SSE and WebSocket. Covers bac...