Reference Architecture · agent

Multi-Agent Orchestration: Supervisor-Worker Architecture

Last updated: April 16, 2026

Quick answer

Use a supervisor LLM (Claude Sonnet 4 or GPT-4o) that decomposes tasks and dispatches to specialized sub-agents via a task queue. Each sub-agent runs in isolation with its own context. Total latency: 5-30s for parallel workers vs 60-120s sequential. Cost at 10K tasks/mo: $150-400 depending on task complexity and model tier.

The problem

Single-agent LLM systems hit context window limits (~200K tokens) and quality ceilings when tasks require diverse expertise — e.g., a research task that needs web search, code execution, and document synthesis simultaneously. Naive chaining creates brittle sequential pipelines that fail when any step errors. Teams report 40-60% of complex agentic tasks fail silently in single-agent setups due to missed subtasks or hallucinated tool calls.

Architecture

input

llm

data

infra

output

User / Client Request

The initial task or query that kicks off the orchestration. Can be a natural-language instruction, structured JSON task, or API call.

Alternatives: Slack message, API webhook, Scheduled cron job

Supervisor LLM

The coordinator model that decomposes the user request into subtasks, assigns them to specialized workers, monitors completion, and synthesizes final output. Must have tool-use capability to call the task dispatcher.

Alternatives: gpt-4o, gemini-2.0-flash, claude-opus-4

Task Queue / Dispatcher

Receives subtask assignments from the supervisor and routes them to the appropriate worker agent. Tracks task status, handles retries on failure, and enforces max concurrency limits.

Alternatives: Celery + RabbitMQ, AWS SQS + Lambda, Inngest, Temporal

Search Worker Agent

Specialized sub-agent for web search and information retrieval. Given a specific query, it uses Tavily or Brave Search and returns structured results to the supervisor.

Alternatives: gpt-4o-mini, Brave Search, Exa AI

Code Execution Worker

Runs Python/JS code in a sandboxed environment (E2B or Pyodide). Returns stdout, stderr, and file artifacts back to the supervisor.

Alternatives: Modal, Daytona, Replit Agent

Document Analysis Worker

Processes PDFs, spreadsheets, or structured data. Extracts key information and returns structured JSON summaries to the supervisor.

Alternatives: gpt-4o, Gemini 2.5 Pro

Shared Memory / State Store

Centralized store where all agents read and write intermediate results. Prevents redundant work and enables the supervisor to track global state. TTL-based expiry cleans up completed task state.

Alternatives: Supabase (Postgres), Upstash Redis, DynamoDB

Loop Guard / Cycle Detector

Tracks the graph of agent calls in the current session. Detects when agent A → agent B → agent A cycles form and aborts with an error. Enforces a max-depth limit (default: 5 hops) and max-turns per session (default: 20).

Alternatives: LangGraph checkpointing, Temporal workflow limits

Result Aggregator

Collects completed subtask results, passes them back to the supervisor for synthesis, and formats the final response for the user.

Alternatives: Direct supervisor synthesis, Template-based merge

The stack

Supervisor ModelClaude Sonnet 4 (claude-sonnet-4-5)

Claude Sonnet 4 has 200K context, strong instruction-following for structured JSON dispatch, and costs ~$3/$15 per M tokens — 5x cheaper than Opus while matching it on orchestration tasks in benchmarks.

Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Opus 4 (for complex reasoning)

Worker ModelsClaude Haiku 4 for search/retrieval workers; Claude Sonnet 4 for code/doc workers

Haiku 4 at $0.80/$4 per M tokens handles 70% of worker tasks at 3x lower cost. Reserve Sonnet for workers that need multi-step reasoning or code generation.

Alternatives: GPT-4o-mini ($0.15/$0.60 per M), Gemini Flash 2.0

Orchestration FrameworkLangGraph (Python) or custom state machine

LangGraph provides built-in cycle detection, checkpointing, and branching that prevents infinite loops. Custom state machines offer lower latency (no framework overhead) but require 2-3x more code.

Alternatives: CrewAI, AutoGen, Temporal, Inngest

Task QueueBullMQ on Upstash Redis

BullMQ handles retries, concurrency limits, and job TTL out of the box. Upstash Redis is serverless with $0.20/100K commands — a 10K tasks/mo workload costs under $5/mo on the queue layer alone.

Alternatives: AWS SQS + Lambda, Inngest, Celery + RabbitMQ

Shared StateRedis with JSON module (Upstash or self-hosted)

Redis JSON enables sub-millisecond reads/writes for inter-agent state. Average task state is 2-10KB — at 10K tasks/mo, total data is trivial. Postgres works but adds 5-20ms vs <1ms Redis.

Alternatives: Supabase Postgres, DynamoDB, Firestore

Sandbox for Code WorkersE2B Code Interpreter

E2B spins up a sandboxed Python/JS environment in ~300ms and charges $0.000014/compute-second. A typical code execution task (5s) costs $0.00007. Modal is cheaper at scale but has a steeper setup curve.

Alternatives: Modal, Daytona, AWS Lambda with layers

ObservabilityLangSmith or Langfuse

Multi-agent systems fail in non-obvious ways — you need per-agent traces with parent-child span relationships. LangSmith integrates natively with LangGraph; Langfuse is self-hostable and GDPR-friendly.

Alternatives: Helicone, OpenTelemetry + Jaeger, Braintrust

Cost at each scale

Prototype

500 tasks/mo

$45/mo

Supervisor LLM (Claude Sonnet 4) — ~2K tokens/task$18

Worker LLMs (mix Haiku/Sonnet) — ~1.5K tokens avg$12

Upstash Redis (task queue + state)$3

E2B Code Interpreter (50% of tasks)$5

Tavily Search API (30% of tasks)$7

Growth

10,000 tasks/mo

$680/mo

Supervisor LLM (Claude Sonnet 4)$360

Worker LLMs (70% Haiku, 30% Sonnet)$160

Upstash Redis Pro$30

E2B Code Interpreter (50% of tasks)$80

Tavily Search API (30% of tasks)$50

Scale

200,000 tasks/mo

$9,800/mo

Supervisor LLM (Claude Sonnet 4 — negotiated tier)$5,200

Worker LLMs (80% Haiku, 20% Sonnet)$2,100

Redis cluster (self-hosted on AWS)$400

E2B / Modal code execution$1,200

Search APIs + observability$900

Latency budget

Total P50: 8,000ms

Total P95: 25,000ms

Total

8000ms · 25000ms p95

Median

P95

Tradeoffs

Failure modes & guardrails

Mitigation: Implement a cycle detector that hashes (agent-id, task-signature) tuples. If the same (agent, task) pair appears twice in a session, abort and return partial results. Set hard limit: max 20 LLM calls per user request.

Mitigation: Worker results can be verbose. Compress worker outputs before returning to supervisor: summarize results >2K tokens using Haiku, return only the structured JSON summary. Cap supervisor context at 50K tokens.

Mitigation: All workers must return a typed result (success | error) — never silence exceptions. The task queue should retry failed workers up to 3 times with exponential backoff before marking the subtask as failed and returning a partial result to the supervisor.

Mitigation: The supervisor should be constrained to a fixed set of worker types (search, code, doc, etc.) via structured output schemas. Prevent open-ended 'create a new agent' instructions by validating the dispatch JSON against a Zod/Pydantic schema before execution.

Mitigation: Implement per-session token budgets tracked in the shared state store. Before each LLM call, check if the session has exceeded the budget (default: $2/session). Abort with a partial result if the budget is exceeded.

View starter code →

Frequently asked questions

How do I prevent agents from calling each other in infinite loops?

Track every (caller-agent-id, task-hash) pair in a session-scoped set stored in Redis. Before any agent dispatches a subtask, compute the hash of the task instruction and check if this (caller, task-hash) pair already exists. If yes, return an error instead of dispatching. Additionally, enforce a global max-turns limit (20 is a safe default) and a max-depth limit (5 hops). LangGraph has built-in recursion limits via `recursion_limit` config.

Should I use one big LLM or many small specialized agents?

Benchmark first. For tasks under 50K tokens with no tool use, a single large model (Claude Sonnet 4 with 200K context) often outperforms multi-agent architectures because it avoids coordination overhead. Switch to multi-agent when: (1) tasks regularly exceed context limits, (2) different subtasks need fundamentally different capabilities (code vs. search vs. image analysis), or (3) parallelism is required for latency.

How do I pass context between agents without hitting token limits?

Use a structured intermediate representation (IR) — a JSON schema that captures only the information the next agent needs, not the full conversation history. For example, the search worker returns {query, results: [{title, snippet, url}], confidence} not its full chain-of-thought. Store IRs in Redis with a 1-hour TTL. Never pass full conversation history between agents — it grows quadratically.

What's the right max concurrency for parallel workers?

Start with 3-5 concurrent workers per user request. More parallelism reduces latency but increases API rate-limit pressure — most LLM providers allow 500-3000 RPM per key. At scale, implement a token-bucket rate limiter in the task queue. Anthropic's Batch API can handle up to 100K requests at 50% cost discount if real-time latency is not required.

How do I debug a multi-agent system when something goes wrong?

Use a tracing tool like LangSmith or Langfuse that captures parent-child span relationships. Every agent call should be tagged with (session-id, agent-type, task-hash, parent-agent-id). With this, you can reconstruct the full execution tree for any session. Without traces, debugging multi-agent failures is essentially impossible.

Architectures

Research Agent

Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...

Data Analyst Agent

Reference architecture for a natural-language-to-SQL agent that queries tabular data, generates charts, and pr...

QA Testing Agent

Reference architecture for an agent that generates test cases from code and requirements, runs them, and diagn...

Code Review Agent

Reference architecture for an LLM-powered pull request review agent that catches bugs, security issues, and st...

LLM Function Calling & Tool Use: Production Architecture

Production patterns for LLM tool use: schema design, parallel tool calls, error handling when tools fail, resu...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o gemini-2-flash

Tools mentioned

LangGraph BullMQ E2B Tavily LangSmith Upstash Redis