Reference Architecture · agent

Sales Outreach Agent

Last updated: April 16, 2026

Quick answer

The production stack uses Claude Sonnet 4 to compose messages grounded in retrieved lead research, Haiku 4 for signal classification, and a warmup + send-time rotation via a dedicated ESP. Expect $0.08–$0.30 per lead for research + message generation and 20–40% reply rate vs 2–5% for generic AI outbound. Critical: every claim in the message must cite a source pulled within the last 48h.

The problem

Outbound SDR teams need personalized messages at scale but generic AI-written emails tank reply rates and burn sender reputation. You need an agent that pulls real signal (recent job change, company news, hiring) and weaves it into a credible first line, then handles the first 2–3 rounds of replies before escalating to a human. The hard parts are deliverability (don't get flagged as spam), accuracy (don't reference a CEO who left), and handoff (hand to human before the agent over-commits).

Architecture

input

llm

data

infra

output

Lead CRM + ICP Filter

Pulls leads from HubSpot/Salesforce that match ICP (role, company size, tech stack). Dedupes against past outreach.

Alternatives: Apollo.io, Clay, Custom Postgres

Lead Research Worker

Parallel agent that scrapes LinkedIn, company site, recent news, GitHub, and funding announcements within the last 60 days.

Alternatives: Exa, Serper API, Apify scrapers

Signal Classifier

Extracts and ranks actionable signals (new job, funding round, product launch, pain post on X). Scores relevance to your offer.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Message Writer

Composes first-touch email grounded in the top signal, matching your brand voice from approved past messages.

Alternatives: GPT-4o, Gemini 2.5 Pro

Deliverability Gate

Checks spam score (SpamAssassin), link count, sender warmup status, time-of-day for recipient timezone.

Alternatives: Instantly, Smartlead, Lemwarm

Email Service Provider

Sends via warmed-up inboxes rotated across domains. Rate-limited per domain.

Alternatives: SendGrid, Postmark, Google Workspace + relay

Reply Handler

Classifies reply (interested/objection/unsubscribe/out-of-office), drafts response for the next 1–2 turns, hands off to AE on meeting intent.

Alternatives: GPT-4o, Gemini 2.5 Pro

Human AE / Calendar

Receives handoff on qualified reply with full context. Meeting books via Cal.com or Chili Piper.

Alternatives: Chili Piper, Cal.com, Calendly

The stack

Writer LLMClaude Sonnet 4

Sonnet 4's tone control is measurably better than GPT-4o on short-form outbound — less AI-voice, better signal integration. Matters because reply rates correlate directly with perceived authenticity.

Alternatives: GPT-4o, Gemini 2.5 Pro

Signal classifierClaude Haiku 4

Haiku 4 ranks signals per lead in <600ms for a fraction of a cent. Running Sonnet here is wasteful.

Alternatives: GPT-4o-mini, DeepSeek R1

Web researchFirecrawl + Exa

Firecrawl handles JS-rendered sites and clean markdown extraction. Exa gives semantic search for 'companies that just did X' queries. Combine them.

Alternatives: Serper, Tavily, Custom Playwright

DeliverabilitySmartlead or Instantly

Sender warmup and rotation are not a side project. Dedicated ESPs for outbound cost $100–500/mo and save you from domain blacklists.

Alternatives: Lemlist, Apollo, Custom + SendGrid

CRMHubSpot or Salesforce

Attribution, deal tracking, and AE handoff live in CRM. The agent writes back activities + reply classification as custom properties.

Alternatives: Attio, Folk, Airtable

OrchestrationInngest or Trigger.dev

Per-lead pipeline with retries, rate limits, and per-domain send caps fits event-driven orchestration naturally. LangGraph overcomplicates a fundamentally linear flow.

Alternatives: Temporal, Plain cron + queue

EvalsBraintrust + A/B framework

You must A/B test subject lines and openers continuously. Reply rate is the only metric that matters and it varies by ICP.

Alternatives: Humanloop, Langfuse

Cost at each scale

Prototype

500 leads/mo, 1 user

$180/mo

Sonnet 4 writer + reply$40

Haiku 4 signal classifier$5

Firecrawl + Exa$50

Smartlead starter$60

Infra + observability$25

Startup

10,000 leads/mo, 5 SDRs

$2,800/mo

Sonnet 4 writer + reply (cached)$900

Haiku 4 classifier$80

Firecrawl + Exa$700

Smartlead growth + warmup inboxes$600

HubSpot Pro$400

Infra + Braintrust evals$120

Scale

200,000 leads/mo, 30 SDRs

$32,500/mo

Sonnet 4 writer + reply (heavy caching)$13,000

Haiku 4 classifier$1,200

Firecrawl + Exa enterprise$6,000

Smartlead scale + 200 inboxes$4,800

Salesforce Enterprise + integrations$5,500

Infra + evals + observability$2,000

Latency budget

Total P50: 9,180ms

Total P95: 19,350ms

Lead pull from CRM

400ms · 1200ms p95

Research (Firecrawl 3 sources parallel)

4500ms · 9000ms p95

Signal classification

600ms · 1400ms p95

Sonnet message generation

2600ms · 4800ms p95

Deliverability check

180ms · 450ms p95

Send via ESP

900ms · 2500ms p95

Median

P95

Tradeoffs

Volume vs reply rate

Sending 10k messages/day with shallow research drops reply rates below 3% and burns domains. Sending 500/day with real 60s-of-research per lead gets 15–30% reply rates and preserves sender reputation. Modern outbound is depth over breadth — the agent exists to enable depth at scale, not to spray further.

Grounded vs creative writing

A strict 'cite a source' rule makes messages credible but can make them boilerplate if the signal is weak. Better: allow creative openers only when signal confidence > 0.8, otherwise fall back to a clean, short problem-statement template. Worst outcome is fabricated specifics.

Auto-reply vs human handoff

Letting the agent handle 2–3 reply turns beats immediate human handoff for nurture replies. For any reply mentioning pricing, meetings, or signing, hand off immediately — over-committing on price or timeline is the #1 way an agent loses a deal.

Failure modes & guardrails

Message references a person who left the company

Mitigation: Require every named-person reference to include a source URL and a 'last_seen' date within 30 days. If LinkedIn data is older than 30 days, refetch before composing. Add a final guard: cross-check named title against current company page.

Deliverability collapse from over-sending

Mitigation: Hard cap 50 sends/day per inbox, 200/day per domain, pause any inbox with bounce rate >3% or spam complaint >0.1%. Run continuous warmup on every inbox. Rotate across 5–10 domains for volume.

Agent hallucinates a joint connection or event

Mitigation: Whitelist only these claim types: job change, funding, product launch, hire, content they published. Block all 'we met at', 'our mutual friend', and 'your competitor X' — these are where hallucinations land hardest.

Reply handler commits to pricing or contract terms

Mitigation: Maintain an explicit stoplist of topics (pricing, contracts, legal, timelines beyond 'this week'). Any reply matching these triggers immediate human handoff with the entire thread. Never allow the agent to propose a number.

GDPR / CAN-SPAM violations

Mitigation: Honor unsubscribe within 10 minutes via ESP API. Never email EU leads without documented lawful basis. Maintain a suppression list synced across all inboxes. Include physical address + one-click unsubscribe in every email — not optional.

Frequently asked questions

What reply rate should I expect from an AI sales agent in 2026?

Well-researched, signal-grounded messages get 15–30% reply rates in most B2B verticals. Generic AI spray-and-pray sits at 1–3%. The difference is entirely in research depth and signal freshness, not model choice.

Is Claude or GPT-4o better for outbound copy?

Claude Sonnet 4 consistently writes less AI-voice-y outbound in blind tests. It's more willing to be brief and plainspoken, whereas GPT-4o tends toward polished and slightly generic. Gemini 2.5 Pro is cheaper but the register feels off.

Do I need a dedicated ESP or can I use SendGrid?

For outbound, you need Smartlead / Instantly / Lemlist — they handle inbox warmup, rotation, and reputation management. Transactional ESPs like SendGrid are built for marketing/transactional and will get flagged as spam within weeks at outbound volume.

Can the agent handle all replies or do humans need to step in?

The agent should handle objection clarifications, out-of-office autoresponders, and nurture for 2–3 turns. The moment a reply mentions pricing, a meeting, a decision maker, or a contract, it hands off to a human AE. Over-committing here is irrecoverable.

How does this compare to Clay or Apollo AI?

Clay is great at orchestrating data + small LLM operations on a table. Apollo adds a dialer and ESP. This reference architecture assembles those pieces: use Clay's data patterns for research, Smartlead for sending, and own the prompt layer directly so you can A/B test.

What's the cost per reply?

At startup scale (~15% reply rate, $0.18 per lead cost-to-send), that's ~$1.20 per reply including infra. For qualified meetings, factor in human AE cost. Target cost-per-meeting of $50–150 all-in for most B2B SaaS.

How do I stay compliant?

CAN-SPAM (US) and CASL (Canada) require opt-out + physical address. GDPR (EU) requires lawful basis — usually legitimate interest, documented per-campaign. Honor unsubscribes within 10 minutes (the law says 10 business days, but anything longer is a risk). Maintain a single global suppression list.

Architectures

Email Triage Agent

Reference architecture for an LLM agent that sorts your inbox, drafts replies, and flags priority mail. Models...

Research Agent

Reference architecture for a multi-step research agent that searches the web, synthesizes sources, and produce...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o

Tools mentioned

firecrawl exa smartlead

Sales Outreach Agent

The problem

Architecture

The stack

Cost at each scale

Latency budget

Tradeoffs

Failure modes & guardrails

Frequently asked questions

Related

Architectures

Models mentioned

Tools mentioned