Reference Architecture · agent

LLM Function Calling & Tool Use: Production Architecture

Q: How do I design tool schemas that the LLM fills correctly?

Write tool descriptions as if explaining to a junior developer: specify what the tool does, what each argument represents, what valid values look like, and what the tool returns. Use enum constraints wherever possible (don't let the model guess the unit system — specify `'unit': {'enum': ['celsius', 'fahrenheit']}`). Add a `usage_examples` field (non-standard but models read it) with 1-2 example invocations. Test schemas by calling your LLM with adversarial inputs and checking argument quality.

Q: Should I use native function calling or ask the model to output JSON?

Use native function calling for production systems. Native tool_use gives you: (1) provider-level schema validation before the API returns, catching format errors before they reach your code; (2) clear separation between reasoning (text) and action (tool_use blocks); (3) parallel tool call support; (4) explicit tool_result injection pattern. Manual JSON parsing works for simple cases but is fragile — models sometimes wrap JSON in markdown code blocks, add extra fields, or produce trailing commas that break JSON.parse().

Q: How do I handle a tool call that requires user confirmation?

Implement a 'pause_for_confirmation' tool that the model calls instead of directly executing a destructive action (send email, delete record, make purchase). This tool returns a confirmation prompt to the client. The client displays the confirmation UI, and the user's approval/rejection is injected as the tool_result. This pattern keeps the model in the loop about user decisions without giving it unilateral authority over destructive actions.

Q: When should I use Claude's tool use vs OpenAI's function calling?

Both are functionally equivalent for most use cases. Choose based on your primary model preference. Key differences: Anthropic requires tool_results for ALL tool_use blocks (OpenAI is more lenient); Anthropic's parallel tool calls return all tool_use blocks in one response turn (OpenAI does the same); Claude tends to have better schema adherence on complex nested types. If you need to support multiple providers, use the Vercel AI SDK or LiteLLM which abstracts the provider-specific format.

Last updated: April 16, 2026

Quick answer

Use native function calling (Anthropic tool_use, OpenAI functions) for structured, validated tool dispatch. Design schemas with strict required/optional fields and enum constraints. Implement a max-tool-calls guard (10 per session). On tool failure, return a structured error result — never skip the tool result, or the model will hallucinate. Parallel tool calls reduce latency by 50-70% when multiple tools can run concurrently.

The problem

Naive tool-use implementations fail in production because tool schemas are underspecified (causing hallucinated arguments), error handling is absent (one tool failure breaks the entire chain), and there are no guards against the model calling tools indefinitely. Teams report 15-25% of tool-augmented LLM requests fail silently — the model calls a tool, gets an error, then either hallucinates a result or loops infinitely.

Architecture

input

llm

data

infra

output

User Request

The user query that requires tool augmentation. Can be a natural language request ('What is the weather in Tokyo?') or a structured task requiring multiple tool calls ('Compare the stock performance of AAPL and MSFT this week and summarize the news').

Alternatives: API request, Slack message, Programmatic task

Tool Schema Registry

Centralized store of all tool definitions in the provider's JSON Schema format. Versioned alongside code. At request time, selects the relevant subset of tools to include in the LLM context (never include all tools — inject only what's needed for the task to reduce context size).

Alternatives: OpenAPI spec (converted to tool schemas), Custom YAML registry, LangChain tool definitions

LLM with Tool Use

The LLM that decides which tools to call, with what arguments, and in what order. Modern LLMs support parallel tool calls — multiple tool_use blocks in a single response. The model continues calling tools until it decides to generate a final text response.

Alternatives: gpt-4o, gemini-2-flash, claude-haiku-4 (for simpler tool decisions), Mistral Large 2

Tool Call Router

Receives the LLM's tool_use blocks, validates arguments against the schema, routes each call to the appropriate tool executor, and runs parallel calls concurrently. Enforces argument validation before execution to catch schema violations before they cause side effects.

Alternatives: LangChain AgentExecutor, LlamaIndex tool routing, OpenAI Assistants thread runner

Search Tool

Web or database search tool. Takes a query string, returns structured results. Common: Tavily Search, Brave Search, or custom vector DB search. Return format matters — LLMs perform better with concise, structured results vs raw HTML.

Alternatives: Brave Search API, Exa AI, SerpAPI, Custom Postgres FTS

Code Execution Tool

Runs Python or JavaScript in a sandboxed environment. Returns stdout, stderr, and generated files. Critical: always sandbox — never execute model-generated code on your production server.

Alternatives: Modal sandboxes, Pyodide (browser-based), AWS Lambda + containers

External API Tool

Calls external REST APIs (weather, stocks, CRM, calendar). Arguments from the LLM are validated against the schema, then injected into the API request. Auth credentials are never exposed in tool schemas — injected server-side.

Alternatives: Zapier (no-code), Make.com, Composio (managed tools)

Max Tool Calls Guard

Tracks the number of tool calls in the current session. If the model exceeds the limit (default: 10 calls per user request), abort the loop and return the partial result with a 'max tools reached' message. Prevents runaway costs from models stuck in tool-call loops.

Alternatives: LangGraph recursion_limit, Custom middleware counter

Tool Result Injector

Formats each tool execution result as a `tool_result` message block and injects it back into the conversation thread. CRITICAL: every tool_use block MUST have a corresponding tool_result — if you omit any, the API returns a validation error. Failed tools return error results, not empty results.

Alternatives: Anthropic SDK message building helpers, OpenAI function result objects

Final LLM Response

After all tools have completed, the LLM synthesizes a final text response incorporating tool results. This response should be streamed to the client for UX. May include formatted data, summaries, or structured outputs.

Alternatives: Streamed text response, Structured JSON (via response_format), Rendered markdown

The stack

LLM ProviderAnthropic Claude Sonnet 4 (claude-sonnet-4-5)

Claude Sonnet 4 has the strongest tool schema adherence in benchmarks — 95%+ schema compliance vs 88-92% for GPT-4o on complex nested schemas. It also supports parallel tool calls natively and has a 200K context window for injecting large tool results.

Alternatives: OpenAI GPT-4o, Google Gemini 2.5 Pro, Mistral Large 2

Tool Schema DefinitionTypeScript + Zod schemas (auto-compiled to JSON Schema)

Define tool schemas in TypeScript/Zod and compile to JSON Schema at build time. This gives you: type safety in your tool executor (validate args before execution), auto-generated JSON Schema for the LLM API, and runtime validation that catches schema drift. Never write JSON Schema by hand — one typo makes the schema invalid and the LLM silently falls back to hallucinated args.

Alternatives: JSON Schema directly, OpenAI function definitions, Pydantic (Python)

Parallel Tool ExecutionPromise.all() with per-tool timeout

When the LLM returns multiple tool_use blocks, execute them in parallel with Promise.all(). For 3 tools that each take 500ms sequentially (1500ms total), parallel execution takes 500ms — a 3x latency reduction. Use Promise.allSettled() instead of Promise.all() so a single tool failure doesn't abort all parallel calls.

Alternatives: Promise.allSettled() for error isolation, Sequential execution (fallback), Worker threads for CPU-intensive tools

Tool ValidationZod safeParse() before every tool execution

Validate all LLM-supplied arguments before execution — models hallucinate invalid argument types or values ~5-10% of the time. `safeParse()` returns an error result (which gets returned as a tool_result error) instead of throwing and crashing your server. Validation cost: <0.1ms per call.

Alternatives: JSON Schema Validator (AJV), Manual argument checking

Error Handling for Failed ToolsReturn structured error in tool_result (never omit the result)

Anthropic's API requires a tool_result for every tool_use block. If you omit it, you get a 400 error. If a tool fails, return: `{type: 'tool_result', tool_use_id: '...', is_error: true, content: 'Error: API returned 404'}`. The model will then either try a different approach or explain to the user what went wrong — much better than a crash.

Alternatives: Retry the tool call once, Fallback to alternative tool

Tool Loop PreventionPer-session counter with max_tool_calls=10 limit

Without a loop guard, a model stuck in a tool-calling loop (e.g., search → not found → search → not found...) can run up $5-50 in a single user session. 10 tool calls covers 99% of legitimate use cases. At call 10, return whatever partial result you have with a note that the tool limit was reached.

Alternatives: LangGraph recursion_limit, Time-based session timeout, Token budget tracking

Tool Results in ContextStructured JSON results with concise summaries

Tool results consume input tokens — at $3/M for Claude Sonnet 4, a 10K token tool result costs $0.03 per call, and with 10 tool calls per session that's $0.30 in tool-result tokens alone. Compress results: for web search, return {title, url, snippet} not full HTML. For API responses, extract only the fields specified in the tool's output_schema.

Alternatives: Raw HTML (for web search — avoid this), Full API response payload

Cost at each scale

Prototype

1,000 requests/mo, avg 3 tool calls each

$35/mo

Claude Sonnet 4 (input: 2K tokens/req x 1K reqs = 2M tokens at $3/M)$6

Claude Sonnet 4 (output: 500 tokens/req x 1K reqs at $15/M)$8

Tool result tokens injected (3K tokens/req x 3 calls x 1K reqs)$9

Tavily Search API (500 calls at $0.004/call)$2

E2B Code Interpreter (200 executions at $0.001/exec)$0

Infrastructure (Vercel, Redis)$10

Growth

50,000 requests/mo, avg 4 tool calls each

$1,850/mo

Claude Sonnet 4 input tokens (system + context per request)$300

Claude Sonnet 4 output tokens$375

Tool result tokens injected into context$540

Tavily Search API (25K calls)$100

E2B Code execution (10K calls)$50

External API costs (weather, stocks, etc.)$200

Infrastructure (Vercel Pro, Redis, monitoring)$285

Scale

1M requests/mo, avg 5 tool calls each

$32,000/mo

LLM tokens (mix Claude Sonnet + Haiku for simpler decisions)$18,000

Tool result context tokens$7,000

Search API costs (500K calls)$2,000

Code execution + external APIs$2,500

Infrastructure and engineering$2,500

Latency budget

Total P50: 2,500ms

Total P95: 7,000ms

Total

2500ms · 7000ms p95

Median

P95

Tradeoffs

Failure modes & guardrails

Mitigation: The model repeatedly calls the same tool with slightly different arguments hoping for a different result. Implement a session-level counter (max 10 calls) and a deduplication check: if the same (tool_name, args_hash) pair appears twice in a session, skip execution and return 'duplicate call detected'. Break the loop by returning 'max tools reached' and ask the model to synthesize a best-effort answer from what it has.

Mitigation: Omitting a tool_result for any tool_use block causes Anthropic API 400 errors ('all tool_use blocks must have a corresponding tool_result'). Fix: use a try-catch wrapper around every tool executor that always returns a result — either success or error. Track outstanding tool_use IDs in a Map and assert all are fulfilled before sending the next message.

Mitigation: The LLM invents argument values not present in the user's input (e.g., making up a zip code for a weather lookup). Prevention: add explicit descriptions to schema properties explaining where the value should come from. If required arguments cannot be inferred from context, use `ask_user` tool pattern — define a tool that signals 'I need more information from the user' instead of hallucinating.

Mitigation: External tools (search, weather, CRM) have their own rate limits. When a tool returns a rate-limit error, the model often retries the same call — compounding the problem. Fix: implement exponential backoff in the tool executor (not visible to the model), cache identical tool calls within a session (same args → same result for 60 seconds), and return rate-limit errors as tool_result errors with a `retry_after` hint in the message.

View starter code →

Frequently asked questions

How do I design tool schemas that the LLM fills correctly?

Write tool descriptions as if explaining to a junior developer: specify what the tool does, what each argument represents, what valid values look like, and what the tool returns. Use enum constraints wherever possible (don't let the model guess the unit system — specify `'unit': {'enum': ['celsius', 'fahrenheit']}`). Add a `usage_examples` field (non-standard but models read it) with 1-2 example invocations. Test schemas by calling your LLM with adversarial inputs and checking argument quality.

Should I use native function calling or ask the model to output JSON?

Use native function calling for production systems. Native tool_use gives you: (1) provider-level schema validation before the API returns, catching format errors before they reach your code; (2) clear separation between reasoning (text) and action (tool_use blocks); (3) parallel tool call support; (4) explicit tool_result injection pattern. Manual JSON parsing works for simple cases but is fragile — models sometimes wrap JSON in markdown code blocks, add extra fields, or produce trailing commas that break JSON.parse().

How do I handle a tool call that requires user confirmation?

Implement a 'pause_for_confirmation' tool that the model calls instead of directly executing a destructive action (send email, delete record, make purchase). This tool returns a confirmation prompt to the client. The client displays the confirmation UI, and the user's approval/rejection is injected as the tool_result. This pattern keeps the model in the loop about user decisions without giving it unilateral authority over destructive actions.

When should I use Claude's tool use vs OpenAI's function calling?

Both are functionally equivalent for most use cases. Choose based on your primary model preference. Key differences: Anthropic requires tool_results for ALL tool_use blocks (OpenAI is more lenient); Anthropic's parallel tool calls return all tool_use blocks in one response turn (OpenAI does the same); Claude tends to have better schema adherence on complex nested types. If you need to support multiple providers, use the Vercel AI SDK or LiteLLM which abstracts the provider-specific format.

Architectures

Multi-Agent Orchestration: Supervisor-Worker Architecture

A production pattern for coordinating multiple specialized LLM agents under a supervisor that delegates tasks,...

Data Analyst Agent

Reference architecture for a natural-language-to-SQL agent that queries tabular data, generates charts, and pr...

Calendar Scheduling Agent

Reference architecture for an agent that parses availability, books meetings across timezones, and handles res...

Text-to-SQL Agent

Reference architecture for translating natural-language questions into safe, correct SQL. Schema-aware prompti...

Token Streaming Pipeline: LLM to UI at Scale

Production architecture for streaming LLM tokens to web and mobile clients using SSE and WebSocket. Covers bac...

Models mentioned

claude-sonnet-4 gpt-4o gemini-2-flash claude-haiku-4

Tools mentioned

Anthropic SDK OpenAI SDK Zod Tavily E2B LangChain Vercel AI SDK