Reference Architecture · generation

Email & Message Composition

Q: How much does generated outbound messaging cost at scale?

At 5M messages/month, expect $25-35k total cost: roughly 40 percent on LLM calls (composition, classification, scoring), 15 percent on CRM + enrichment, 15 percent on sending provider, rest on infra, observability, and human review tooling. Per-message cost lands at $0.005-$0.01 all-in.

Q: Can I fine-tune on my own best-performing emails?

Yes, and it is worth it above ~100k messages/month. Fine-tune Haiku 4 on your top-performing outreach (measured by reply rate) to capture voice and patterns. Fine-tuned Haiku matches zero-shot Sonnet 4 quality at 5x lower cost for your specific use case. Below that volume, few-shot prompting with 3-5 exemplars is easier to maintain and nearly as good.

Last updated: April 16, 2026

Quick answer

The production stack retrieves recipient context (CRM, thread history, enrichment signals), classifies the message intent and tone level, generates 2-3 variants with Claude Sonnet 4 or GPT-4o, scores variants against a rubric (tone, personalization, deliverability), and either auto-sends the top variant or queues it for human approval based on message stakes. Expect $0.002 to $0.02 per generated message with reply rates 2-4x generic templates.

The problem

You need to generate outbound messages at scale — sales outreach, customer success check-ins, transactional notifications, support replies — that feel personal, respect tone, avoid spam filters, and do not embarrass you with hallucinated names or references. A single-shot GPT call produces generic slop that tanks reply rates. The system must personalize from CRM context, calibrate tone to the relationship, and route high-stakes messages to human review.

Architecture

input

llm

data

infra

output

Message Trigger

Entry point: a CRM event (new lead, stage change), a user action (signup, churn signal), a human-initiated draft request, or a scheduled cadence.

Alternatives: Segment CDP event, HubSpot workflow trigger, Zapier webhook, Manual draft request from inbox

Recipient Context Retriever

Pulls CRM record, recent thread history, recent website/product behavior, firmographic enrichment (Clearbit, Apollo), and any explicit preferences (unsubscribe status, communication cadence).

Alternatives: Pure CRM lookup (HubSpot, Salesforce), Vector search over past interactions, Graph-based recipient memory

Intent & Stakes Classifier

Classifies message intent (cold outreach, follow-up, support reply, renewal, escalation) and stakes (low — transactional, high — executive or legal). Stakes determines human-review routing.

Alternatives: Rule-based classifier, Claude Haiku 4, GPT-4o-mini

Tone & Relationship Profiler

Derives the right tone (formal/casual, detailed/brief, direct/soft) from relationship stage, past thread tone, recipient seniority, and explicit sender style examples.

Alternatives: Static per-segment tone config, LLM-inferred from thread history, Sender's own writing samples (fine-tune or few-shot)

Message Composer

Generates 2-3 variants given intent, tone profile, recipient context, and sender persona. Emits subject + body + optional CTA. Uses structured output.

Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Haiku 4 for transactional high-volume

Variant Scorer

Scores each variant against a rubric: personalization depth, tone match, deliverability (spam-triggering words, link quality), length appropriateness, and CTA clarity. Picks the top variant or ranks them.

Alternatives: LLM-as-judge (Haiku), Hard-rule checker + LLM judge hybrid, Learned bandit from past reply rates

Deliverability Guardrails

Deterministic checks: spam-trigger word scan, link reputation check, image-to-text ratio for HTML emails, unsubscribe link presence, dmarc/dkim sender validity.

Alternatives: SpamAssassin integration, Litmus API, Custom regex + allowlist

Human Approval Queue

High-stakes messages (executive recipients, legal/regulatory, first-touch enterprise outreach) route to a human to approve, edit, or reject before send.

Alternatives: Slack approval flow, In-CRM approval widget, Mobile push approval

Send & Track

Sends via provider (SendGrid, Postmark, Slack API, Twilio). Tracks open, click, reply, and downstream conversion. Feeds outcomes back as bandit signal.

Alternatives: SendGrid, Postmark, Resend, SES + custom reply parsing

The stack

Composer modelClaude Sonnet 4 for quality-critical, Claude Haiku 4 for transactional volume

Sonnet 4 preserves a distinct sender voice and personalizes without sounding robotic. Haiku 4 is 5x cheaper and good enough for transactional (receipt, notification, reminder) flows where personalization is shallow. GPT-4o is a close second for English; Sonnet edges ahead on tone preservation across follow-ups.

Alternatives: GPT-4o, Gemini 2.5 Pro

Intent and stakes classifierGemini 2.0 Flash or Claude Haiku 4

This runs on every message and is latency-sensitive. Flash at 200-400ms TTFT and 5-20k tokens out per dollar is perfect. For structured intent labels (10-20 categories), a fine-tuned Flash or Haiku beats zero-shot by 5-10 points.

Alternatives: GPT-4o-mini, Rule-based classifier

Tone profilingFew-shot examples from past thread + sender style samples

Tone drifts with relationship stage — formal on first touch, casual after 5 exchanges. The cheapest reliable method is to include the last 2-3 messages from the thread as tone anchors plus 3-5 paragraph exemplars of the sender's writing. Fine-tune only for high-volume individual senders (sales reps, CS managers).

Alternatives: Static per-segment config, Fine-tuned per-sender model for high-volume senders

Variant generation2-3 variants at temperature 0.7

Multi-variant generation costs 2-3x more but lets the scorer pick the best and provides signal for a learned bandit. Temperature 0.7 gives genuine diversity; too low and variants collapse to the same message, too high and they drift off-topic. For transactional flows (receipts), single variant at 0.3 is fine.

Alternatives: Single variant at temperature 0.3, 5+ variants with bandit selection

DeliverabilityStatic spam-word scan + SendGrid or Postmark reputation monitoring

Static checks catch obvious issues (ALL CAPS, excessive exclamation, spam-trigger phrases, broken links). Provider reputation monitoring catches domain or IP issues. Both are cheap; skipping either costs you 10-30 percent of deliverability on cold outbound.

Alternatives: SpamAssassin, Litmus, Custom blocklist

Feedback loopOpen + click + reply tracking fed back as bandit signal

Reply rate is the only metric that matters for outreach; open/click are weaker proxies. Feed the outcome of every message back to the scorer as training signal. Over 3-6 months, this closes the loop and the scorer's rankings start correlating with real reply rates.

Alternatives: No feedback (pure prompt engineering), Manual A/B tests, RLHF on reply outcomes

Cost at each scale

Prototype

2k messages/mo

$40/mo

Composition (Sonnet 4 mixed with Haiku)$15

Intent + tone classifier (Flash)$3

Variant scoring (Haiku)$4

CRM + enrichment APIs$8

Sender (SendGrid free tier)$0

Observability$10

Startup

100k messages/mo

$980/mo

Composition (Sonnet 4 for outreach, Haiku 4 for transactional)$450

Classifier + tone profiler (Flash)$80

Variant scoring$120

CRM context retrieval + enrichment$100

Deliverability (Postmark)$100

Infra (Vercel Pro)$30

Observability (Langfuse)$100

Scale

5M messages/mo

$28,500/mo

Composition (heavily Haiku 4, Sonnet 4 for top-of-funnel)$12,000

Classifier + tone profiler (Flash fine-tuned)$1,800

Variant scoring + bandit$2,400

CRM + enrichment at scale$3,500

Sender (SendGrid Enterprise + dedicated IPs)$4,000

Infra + queue (Vercel Enterprise + Inngest)$1,800

Observability + evals$1,500

Human approval tooling$1,500

Latency budget

Total P50: 3,090ms

Total P95: 6,670ms

Context retrieval (CRM + thread)

250ms · 650ms p95

Intent + tone classification

300ms · 600ms p95

Composer (streamed, 3 variants parallel)

1800ms · 3500ms p95

Variant scoring (Haiku judge)

400ms · 900ms p95

Deliverability check

40ms · 120ms p95

Send (provider API)

300ms · 900ms p95

Median

P95

Tradeoffs

Personalization depth vs generation cost

Heavy personalization (recent product behavior, mutual connections, firmographic context) lifts reply rates 2-4x on cold outreach but triples the context retrieval cost and increases input tokens. For transactional messages (receipts, reminders) skip deep personalization — it does not move the needle. For high-value outreach (enterprise, >$50k ACV), personalize hard; for long-tail nurture, template-heavy with light personalization is more efficient.

Multi-variant vs single-variant generation

Generating 3 variants at temperature 0.7 and scoring picks the best message but costs 3x as much. For cold outreach where reply rates are the KPI, the extra cost is easily justified. For transactional flows with low variance in outcome, single variant at low temperature wins. Start multi-variant, measure whether the scorer's 'best' actually wins on reply rate; graduate to single variant when it does not.

Human approval vs fully automated send

Human approval catches embarrassing hallucinations (wrong company name, wrong exec title, off-tone) but adds hours-to-days of latency and caps volume at human throughput. Route by stakes: auto-send for low-stakes transactional, human-approve for executive recipients, regulated industries, and first-touch enterprise. A 5-10 percent sample of auto-sent messages should still route through a human for ongoing QA.

Failure modes & guardrails

Hallucinated names, titles, or company references

Mitigation: Restrict the composer to use only entities explicitly present in the retrieved context. Post-generation, extract every proper noun and verify against the CRM record and retrieval context. Reject messages with unverifiable entities. Track hallucination rate per model version; a spike usually means a bad context retrieval.

Tone mismatch — cold message reads like a warm follow-up or vice versa

Mitigation: Make the tone profile explicit and structured (formality 1-5, directness 1-5, length preference). Include 2-3 example messages at the target tone in the system prompt. Run a tone-consistency LLM judge; reject variants scoring below threshold. Fine-tune per high-volume sender when scale justifies it.

Sender reputation damage from low-quality generated outreach

Mitigation: Warm up new domains with transactional volume before cold outreach. Monitor bounce, complaint, and unsubscribe rates per domain and per sender. If complaint rate exceeds 0.1 percent or bounce rate exceeds 2 percent, pause that sender and investigate. Never send generated cold outreach from a domain also used for transactional.

Prompt injection via thread history (recipient includes instructions in their reply)

Mitigation: Treat thread history as untrusted input. Strip or neutralize patterns that look like instructions. Use structured output so the composer can only emit subject/body/cta fields. Run a prompt-injection classifier on inbound thread content before including it in the composer prompt.

Generated message triggers spam filters despite passing static checks

Mitigation: Monitor provider-side reputation metrics (SendGrid, Postmark dashboards). Track inbox placement using Litmus or seed-list testing. Feed back deliverability outcomes to the scorer. Maintain a deny-list of phrases that correlate with spam-folder placement for your specific sending pattern.

Frequently asked questions

Which LLM is best for email and message composition in 2026?

Claude Sonnet 4 for outreach, sales, and customer success where tone and personalization matter most. Haiku 4 or Gemini 2.0 Flash for transactional flows where personalization is shallow and volume is high. GPT-4o is close to Sonnet on English; Sonnet holds a small edge on tone preservation across multi-touch sequences.

How much lift does personalization actually provide?

On cold outreach, deep personalization (recent product behavior, mutual connections, firmographic context) lifts reply rates 2-4x versus generic templates in 2026 benchmarks. On warm follow-ups it lifts 1.3-1.8x. On transactional messages (receipts, reminders) it is roughly neutral. Invest heavily where the lift is there; save the tokens where it is not.

Should I generate multiple variants or a single message?

Multi-variant (3 at temperature 0.7) with a scorer picks better messages at 3x the composition cost — usually worth it for outreach where reply rates are the KPI. For transactional flows, single variant at low temperature is cheaper and equally effective. Start multi-variant, measure whether the scorer's picks beat a random variant in production; graduate to single variant if not.

How do I prevent hallucinated names and titles in generated messages?

Pass only verified CRM/enrichment data into the prompt, and post-generation extract every proper noun and verify it exists in the input. Reject messages with unverifiable entities. This catches 95+ percent of embarrassing errors. Also keep a deny-list of entity shapes (company name + random title) that should never co-occur.

When do I need human approval before sending?

Always for: executive recipients (VP and above at target accounts), regulated industries (healthcare, financial services, legal), first-touch enterprise outreach above $50k ACV target, and anything with pricing or legal claims. For low-stakes transactional and nurture, auto-send with a 5-10 percent sample to human review for ongoing QA is the production pattern.

How do I stop generated emails from going to spam?

Three layers: static checks for spam-trigger words and broken links before send, provider reputation monitoring (SendGrid, Postmark), and seed-list testing (Litmus, GlockApps). Warm up new sending domains with transactional volume before cold outreach. Monitor complaint rate (<0.1 percent) and bounce rate (<2 percent); pause any sender breaching thresholds.

How much does generated outbound messaging cost at scale?

At 5M messages/month, expect $25-35k total cost: roughly 40 percent on LLM calls (composition, classification, scoring), 15 percent on CRM + enrichment, 15 percent on sending provider, rest on infra, observability, and human review tooling. Per-message cost lands at $0.005-$0.01 all-in.

Can I fine-tune on my own best-performing emails?

Yes, and it is worth it above ~100k messages/month. Fine-tune Haiku 4 on your top-performing outreach (measured by reply rate) to capture voice and patterns. Fine-tuned Haiku matches zero-shot Sonnet 4 quality at 5x lower cost for your specific use case. Below that volume, few-shot prompting with 3-5 exemplars is easier to maintain and nearly as good.

Architectures

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Long-Form Content Generation

Reference architecture for generating 3,000 to 10,000 word blog posts, research reports, and documentation. Ou...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o gemini-2-0-flash

Tools mentioned

langfuse