Reference Architecture · agent
Sales Outreach Agent
Last updated: April 16, 2026
Quick answer
The production stack uses Claude Sonnet 4 to compose messages grounded in retrieved lead research, Haiku 4 for signal classification, and a warmup + send-time rotation via a dedicated ESP. Expect $0.08–$0.30 per lead for research + message generation and 20–40% reply rate vs 2–5% for generic AI outbound. Critical: every claim in the message must cite a source pulled within the last 48h.
The problem
Outbound SDR teams need personalized messages at scale but generic AI-written emails tank reply rates and burn sender reputation. You need an agent that pulls real signal (recent job change, company news, hiring) and weaves it into a credible first line, then handles the first 2–3 rounds of replies before escalating to a human. The hard parts are deliverability (don't get flagged as spam), accuracy (don't reference a CEO who left), and handoff (hand to human before the agent over-commits).
Architecture
Lead CRM + ICP Filter
Pulls leads from HubSpot/Salesforce that match ICP (role, company size, tech stack). Dedupes against past outreach.
Alternatives: Apollo.io, Clay, Custom Postgres
Lead Research Worker
Parallel agent that scrapes LinkedIn, company site, recent news, GitHub, and funding announcements within the last 60 days.
Alternatives: Exa, Serper API, Apify scrapers
Signal Classifier
Extracts and ranks actionable signals (new job, funding round, product launch, pain post on X). Scores relevance to your offer.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Message Writer
Composes first-touch email grounded in the top signal, matching your brand voice from approved past messages.
Alternatives: GPT-4o, Gemini 2.5 Pro
Deliverability Gate
Checks spam score (SpamAssassin), link count, sender warmup status, time-of-day for recipient timezone.
Alternatives: Instantly, Smartlead, Lemwarm
Email Service Provider
Sends via warmed-up inboxes rotated across domains. Rate-limited per domain.
Alternatives: SendGrid, Postmark, Google Workspace + relay
Reply Handler
Classifies reply (interested/objection/unsubscribe/out-of-office), drafts response for the next 1–2 turns, hands off to AE on meeting intent.
Alternatives: GPT-4o, Gemini 2.5 Pro
Human AE / Calendar
Receives handoff on qualified reply with full context. Meeting books via Cal.com or Chili Piper.
Alternatives: Chili Piper, Cal.com, Calendly
The stack
Sonnet 4's tone control is measurably better than GPT-4o on short-form outbound — less AI-voice, better signal integration. Matters because reply rates correlate directly with perceived authenticity.
Alternatives: GPT-4o, Gemini 2.5 Pro
Haiku 4 ranks signals per lead in <600ms for a fraction of a cent. Running Sonnet here is wasteful.
Alternatives: GPT-4o-mini, DeepSeek R1
Firecrawl handles JS-rendered sites and clean markdown extraction. Exa gives semantic search for 'companies that just did X' queries. Combine them.
Alternatives: Serper, Tavily, Custom Playwright
Sender warmup and rotation are not a side project. Dedicated ESPs for outbound cost $100–500/mo and save you from domain blacklists.
Alternatives: Lemlist, Apollo, Custom + SendGrid
Attribution, deal tracking, and AE handoff live in CRM. The agent writes back activities + reply classification as custom properties.
Alternatives: Attio, Folk, Airtable
Per-lead pipeline with retries, rate limits, and per-domain send caps fits event-driven orchestration naturally. LangGraph overcomplicates a fundamentally linear flow.
Alternatives: Temporal, Plain cron + queue
You must A/B test subject lines and openers continuously. Reply rate is the only metric that matters and it varies by ICP.
Alternatives: Humanloop, Langfuse
Cost at each scale
Prototype
500 leads/mo, 1 user
$180/mo
Startup
10,000 leads/mo, 5 SDRs
$2,800/mo
Scale
200,000 leads/mo, 30 SDRs
$32,500/mo
Latency budget
Tradeoffs
Volume vs reply rate
Sending 10k messages/day with shallow research drops reply rates below 3% and burns domains. Sending 500/day with real 60s-of-research per lead gets 15–30% reply rates and preserves sender reputation. Modern outbound is depth over breadth — the agent exists to enable depth at scale, not to spray further.
Grounded vs creative writing
A strict 'cite a source' rule makes messages credible but can make them boilerplate if the signal is weak. Better: allow creative openers only when signal confidence > 0.8, otherwise fall back to a clean, short problem-statement template. Worst outcome is fabricated specifics.
Auto-reply vs human handoff
Letting the agent handle 2–3 reply turns beats immediate human handoff for nurture replies. For any reply mentioning pricing, meetings, or signing, hand off immediately — over-committing on price or timeline is the #1 way an agent loses a deal.
Failure modes & guardrails
Message references a person who left the company
Mitigation: Require every named-person reference to include a source URL and a 'last_seen' date within 30 days. If LinkedIn data is older than 30 days, refetch before composing. Add a final guard: cross-check named title against current company page.
Deliverability collapse from over-sending
Mitigation: Hard cap 50 sends/day per inbox, 200/day per domain, pause any inbox with bounce rate >3% or spam complaint >0.1%. Run continuous warmup on every inbox. Rotate across 5–10 domains for volume.
Agent hallucinates a joint connection or event
Mitigation: Whitelist only these claim types: job change, funding, product launch, hire, content they published. Block all 'we met at', 'our mutual friend', and 'your competitor X' — these are where hallucinations land hardest.
Reply handler commits to pricing or contract terms
Mitigation: Maintain an explicit stoplist of topics (pricing, contracts, legal, timelines beyond 'this week'). Any reply matching these triggers immediate human handoff with the entire thread. Never allow the agent to propose a number.
GDPR / CAN-SPAM violations
Mitigation: Honor unsubscribe within 10 minutes via ESP API. Never email EU leads without documented lawful basis. Maintain a suppression list synced across all inboxes. Include physical address + one-click unsubscribe in every email — not optional.
Frequently asked questions
What reply rate should I expect from an AI sales agent in 2026?
Well-researched, signal-grounded messages get 15–30% reply rates in most B2B verticals. Generic AI spray-and-pray sits at 1–3%. The difference is entirely in research depth and signal freshness, not model choice.
Is Claude or GPT-4o better for outbound copy?
Claude Sonnet 4 consistently writes less AI-voice-y outbound in blind tests. It's more willing to be brief and plainspoken, whereas GPT-4o tends toward polished and slightly generic. Gemini 2.5 Pro is cheaper but the register feels off.
Do I need a dedicated ESP or can I use SendGrid?
For outbound, you need Smartlead / Instantly / Lemlist — they handle inbox warmup, rotation, and reputation management. Transactional ESPs like SendGrid are built for marketing/transactional and will get flagged as spam within weeks at outbound volume.
Can the agent handle all replies or do humans need to step in?
The agent should handle objection clarifications, out-of-office autoresponders, and nurture for 2–3 turns. The moment a reply mentions pricing, a meeting, a decision maker, or a contract, it hands off to a human AE. Over-committing here is irrecoverable.
How does this compare to Clay or Apollo AI?
Clay is great at orchestrating data + small LLM operations on a table. Apollo adds a dialer and ESP. This reference architecture assembles those pieces: use Clay's data patterns for research, Smartlead for sending, and own the prompt layer directly so you can A/B test.
What's the cost per reply?
At startup scale (~15% reply rate, $0.18 per lead cost-to-send), that's ~$1.20 per reply including infra. For qualified meetings, factor in human AE cost. Target cost-per-meeting of $50–150 all-in for most B2B SaaS.
How do I stay compliant?
CAN-SPAM (US) and CASL (Canada) require opt-out + physical address. GDPR (EU) requires lawful basis — usually legitimate interest, documented per-campaign. Honor unsubscribes within 10 minutes (the law says 10 business days, but anything longer is a risk). Maintain a single global suppression list.
Related
Architectures
Email Triage Agent
Reference architecture for an LLM agent that sorts your inbox, drafts replies, and flags priority mail. Models...
Research Agent
Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...