Reference Architecture · agent
Multi-Agent Orchestration: Supervisor-Worker Architecture
Last updated: April 16, 2026
Quick answer
Use a supervisor LLM (Claude Sonnet 4 or GPT-4o) that decomposes tasks and dispatches to specialized sub-agents via a task queue. Each sub-agent runs in isolation with its own context. Total latency: 5-30s for parallel workers vs 60-120s sequential. Cost at 10K tasks/mo: $150-400 depending on task complexity and model tier.
The problem
Single-agent LLM systems hit context window limits (~200K tokens) and quality ceilings when tasks require diverse expertise — e.g., a research task that needs web search, code execution, and document synthesis simultaneously. Naive chaining creates brittle sequential pipelines that fail when any step errors. Teams report 40-60% of complex agentic tasks fail silently in single-agent setups due to missed subtasks or hallucinated tool calls.
Architecture
User / Client Request
The initial task or query that kicks off the orchestration. Can be a natural-language instruction, structured JSON task, or API call.
Alternatives: Slack message, API webhook, Scheduled cron job
Supervisor LLM
The coordinator model that decomposes the user request into subtasks, assigns them to specialized workers, monitors completion, and synthesizes final output. Must have tool-use capability to call the task dispatcher.
Alternatives: gpt-4o, gemini-2.0-flash, claude-opus-4
Task Queue / Dispatcher
Receives subtask assignments from the supervisor and routes them to the appropriate worker agent. Tracks task status, handles retries on failure, and enforces max concurrency limits.
Alternatives: Celery + RabbitMQ, AWS SQS + Lambda, Inngest, Temporal
Search Worker Agent
Specialized sub-agent for web search and information retrieval. Given a specific query, it uses Tavily or Brave Search and returns structured results to the supervisor.
Alternatives: gpt-4o-mini, Brave Search, Exa AI
Code Execution Worker
Runs Python/JS code in a sandboxed environment (E2B or Pyodide). Returns stdout, stderr, and file artifacts back to the supervisor.
Alternatives: Modal, Daytona, Replit Agent
Document Analysis Worker
Processes PDFs, spreadsheets, or structured data. Extracts key information and returns structured JSON summaries to the supervisor.
Alternatives: gpt-4o, Gemini 2.5 Pro
Shared Memory / State Store
Centralized store where all agents read and write intermediate results. Prevents redundant work and enables the supervisor to track global state. TTL-based expiry cleans up completed task state.
Alternatives: Supabase (Postgres), Upstash Redis, DynamoDB
Loop Guard / Cycle Detector
Tracks the graph of agent calls in the current session. Detects when agent A → agent B → agent A cycles form and aborts with an error. Enforces a max-depth limit (default: 5 hops) and max-turns per session (default: 20).
Alternatives: LangGraph checkpointing, Temporal workflow limits
Result Aggregator
Collects completed subtask results, passes them back to the supervisor for synthesis, and formats the final response for the user.
Alternatives: Direct supervisor synthesis, Template-based merge
The stack
Claude Sonnet 4 has 200K context, strong instruction-following for structured JSON dispatch, and costs ~$3/$15 per M tokens — 5x cheaper than Opus while matching it on orchestration tasks in benchmarks.
Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Opus 4 (for complex reasoning)
Haiku 4 at $0.80/$4 per M tokens handles 70% of worker tasks at 3x lower cost. Reserve Sonnet for workers that need multi-step reasoning or code generation.
Alternatives: GPT-4o-mini ($0.15/$0.60 per M), Gemini Flash 2.0
LangGraph provides built-in cycle detection, checkpointing, and branching that prevents infinite loops. Custom state machines offer lower latency (no framework overhead) but require 2-3x more code.
Alternatives: CrewAI, AutoGen, Temporal, Inngest
BullMQ handles retries, concurrency limits, and job TTL out of the box. Upstash Redis is serverless with $0.20/100K commands — a 10K tasks/mo workload costs under $5/mo on the queue layer alone.
Alternatives: AWS SQS + Lambda, Inngest, Celery + RabbitMQ
Redis JSON enables sub-millisecond reads/writes for inter-agent state. Average task state is 2-10KB — at 10K tasks/mo, total data is trivial. Postgres works but adds 5-20ms vs <1ms Redis.
Alternatives: Supabase Postgres, DynamoDB, Firestore
E2B spins up a sandboxed Python/JS environment in ~300ms and charges $0.000014/compute-second. A typical code execution task (5s) costs $0.00007. Modal is cheaper at scale but has a steeper setup curve.
Alternatives: Modal, Daytona, AWS Lambda with layers
Multi-agent systems fail in non-obvious ways — you need per-agent traces with parent-child span relationships. LangSmith integrates natively with LangGraph; Langfuse is self-hostable and GDPR-friendly.
Alternatives: Helicone, OpenTelemetry + Jaeger, Braintrust
Cost at each scale
Prototype
500 tasks/mo
$45/mo
Growth
10,000 tasks/mo
$680/mo
Scale
200,000 tasks/mo
$9,800/mo
Latency budget
Tradeoffs
Failure modes & guardrails
Mitigation: Implement a cycle detector that hashes (agent-id, task-signature) tuples. If the same (agent, task) pair appears twice in a session, abort and return partial results. Set hard limit: max 20 LLM calls per user request.
Mitigation: Worker results can be verbose. Compress worker outputs before returning to supervisor: summarize results >2K tokens using Haiku, return only the structured JSON summary. Cap supervisor context at 50K tokens.
Mitigation: All workers must return a typed result (success | error) — never silence exceptions. The task queue should retry failed workers up to 3 times with exponential backoff before marking the subtask as failed and returning a partial result to the supervisor.
Mitigation: The supervisor should be constrained to a fixed set of worker types (search, code, doc, etc.) via structured output schemas. Prevent open-ended 'create a new agent' instructions by validating the dispatch JSON against a Zod/Pydantic schema before execution.
Mitigation: Implement per-session token budgets tracked in the shared state store. Before each LLM call, check if the session has exceeded the budget (default: $2/session). Abort with a partial result if the budget is exceeded.
Frequently asked questions
How do I prevent agents from calling each other in infinite loops?
Track every (caller-agent-id, task-hash) pair in a session-scoped set stored in Redis. Before any agent dispatches a subtask, compute the hash of the task instruction and check if this (caller, task-hash) pair already exists. If yes, return an error instead of dispatching. Additionally, enforce a global max-turns limit (20 is a safe default) and a max-depth limit (5 hops). LangGraph has built-in recursion limits via `recursion_limit` config.
Should I use one big LLM or many small specialized agents?
Benchmark first. For tasks under 50K tokens with no tool use, a single large model (Claude Sonnet 4 with 200K context) often outperforms multi-agent architectures because it avoids coordination overhead. Switch to multi-agent when: (1) tasks regularly exceed context limits, (2) different subtasks need fundamentally different capabilities (code vs. search vs. image analysis), or (3) parallelism is required for latency.
How do I pass context between agents without hitting token limits?
Use a structured intermediate representation (IR) — a JSON schema that captures only the information the next agent needs, not the full conversation history. For example, the search worker returns {query, results: [{title, snippet, url}], confidence} not its full chain-of-thought. Store IRs in Redis with a 1-hour TTL. Never pass full conversation history between agents — it grows quadratically.
What's the right max concurrency for parallel workers?
Start with 3-5 concurrent workers per user request. More parallelism reduces latency but increases API rate-limit pressure — most LLM providers allow 500-3000 RPM per key. At scale, implement a token-bucket rate limiter in the task queue. Anthropic's Batch API can handle up to 100K requests at 50% cost discount if real-time latency is not required.
How do I debug a multi-agent system when something goes wrong?
Use a tracing tool like LangSmith or Langfuse that captures parent-child span relationships. Every agent call should be tagged with (session-id, agent-type, task-hash, parent-agent-id). With this, you can reconstruct the full execution tree for any session. Without traces, debugging multi-agent failures is essentially impossible.
Related
Architectures
Research Agent
Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...
Data Analyst Agent
Reference architecture for a natural-language-to-SQL agent that queries tabular data, generates charts, and pr...
QA Testing Agent
Reference architecture for an agent that generates test cases from code and requirements, runs them, and diagn...
Code Review Agent
Reference architecture for an LLM-powered pull request review agent that catches bugs, security issues, and st...
LLM Function Calling & Tool Use: Production Architecture
Production patterns for LLM tool use: schema design, parallel tool calls, error handling when tools fail, resu...