Reference Architecture · agent

Email Triage Agent

Last updated: April 16, 2026

Quick answer

The production stack uses Claude Haiku 4 as the classifier on every inbound email, Claude Sonnet 4 as the drafter for replies, and a strict human-in-the-loop before anything sends. Expect $0.002–$0.01 per email processed and $0.02–$0.05 per drafted reply, with 1–3s latency for classification and 3–6s for a draft. Never run auto-send without confidence thresholds and allow-lists.

The problem

Knowledge workers get 100–300 emails a day and spend 2+ hours on inbox triage. You need an agent that classifies priority, extracts action items, drafts replies for common threads, and never, ever auto-sends something embarrassing. The hard parts are latency on long threads, handling reply-all etiquette, and avoiding hallucinated commitments.

Architecture

if reply neededif task-bearingtone + past repliesMail IngestINPUTThread NormalizerINFRAPriority ClassifierLLMUser Context StoreDATAReply DrafterLLMAction ExtractorLLMHuman Review PanelOUTPUTSend/Label/ArchiveOUTPUT
input
llm
data
infra
output

Mail Ingest

Watches Gmail or Outlook via push subscription (Pub/Sub or Graph webhook). Pulls full thread context per message.

Alternatives: IMAP polling, Nylas API, Microsoft Graph

Thread Normalizer

Strips quoted replies, signatures, legal footers; resolves sender identity against CRM; extracts attachments.

Alternatives: mailparser, EmailReplyParser, Custom regex + LLM fallback

Priority Classifier

Fast model scores each email on priority (P0–P3), intent (ask/info/spam/newsletter), and action required.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

User Context Store

Stores user VIP list, project context, tone examples, past replies for few-shot prompting.

Alternatives: Supabase, Pinecone namespace per user

Reply Drafter

On P0/P1 that needs reply, generates a draft in the user's tone using retrieved past replies.

Alternatives: GPT-4o, Gemini 2.5 Pro

Action Extractor

Pulls commitments, deadlines, and tasks into a structured list for a todo system.

Alternatives: GPT-4o-mini, Structured-output call on Sonnet

Human Review Panel

Web/mobile UI where user approves drafts, sees priority queue, and can send or edit.

Alternatives: Slack review bot, iMessage approval flow

Send/Label/Archive

Once approved, writes back: sends reply, applies labels, archives newsletters, snoozes low priority.

Alternatives: Gmail API, Outlook Graph, Superhuman integration

The stack

Priority classifierClaude Haiku 4

Runs on every inbound email — cost matters. Haiku 4 classifies priority + intent in <1s for $0.001.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Reply drafterClaude Sonnet 4

Tone matching and multi-turn context are where Sonnet 4 excels. Worth the cost only on drafts, not every email.

Alternatives: GPT-4o, Gemini 2.5 Pro

Mail APIGmail API + Microsoft Graph

Native APIs give push notifications and label/thread primitives. Nylas is a nice abstraction if you need both clouds.

Alternatives: Nylas, IMAP/SMTP

User memory storeSupabase + pgvector

Per-user isolation via RLS matters for privacy. pgvector is plenty fast for <10k vectors per user.

Alternatives: Pinecone, Qdrant

OrchestrationCustom event-driven (Inngest or Trigger.dev)

Email is naturally event-driven with retries. Inngest handles dedup, retries, and fanout cleanly — better fit than a chat-style agent loop.

Alternatives: Temporal, LangGraph

Human review UINext.js + shadcn

Batched review once or twice a day beats per-email notification spam. Keyboard-driven review UI (j/k/enter) is the magic.

Alternatives: iOS Shortcut, Slack bot

Cost at each scale

Prototype

1 user, ~3,000 emails/mo

$12/mo

Haiku 4 classification$4
Sonnet 4 drafts (~300 replies)$6
Supabase (free tier)$0
Inngest (free tier)$0
Hosting (Vercel Hobby)$0
Observability$2

Startup

500 users, ~1.5M emails/mo

$4,800/mo

Haiku 4 classification$1,800
Sonnet 4 drafts (~150k drafts)$2,100
Supabase Pro$120
Inngest Pro$200
Infra (Vercel Pro)$180
Observability (Langfuse)$400

Scale

25,000 users, ~75M emails/mo

$182,000/mo

Haiku 4 classification (cached)$75,000
Sonnet 4 drafts (~7M, cached)$85,000
Supabase Enterprise$8,000
Orchestration (Temporal Cloud)$4,500
Infra$6,000
Observability + evals$3,500

Latency budget

Total P50: 4,900ms
Total P95: 9,670ms
Ingest + normalize
200ms · 700ms p95
Haiku classification
700ms · 1400ms p95
Memory lookup (tone + VIP)
80ms · 220ms p95
Sonnet draft generation
3200ms · 5800ms p95
Action extraction
600ms · 1200ms p95
Write to review queue
120ms · 350ms p95
Median
P95

Tradeoffs

Per-email vs batched drafting

Drafting on every inbound email feels magical but 3–5x the cost. Batched drafting (once user opens the review UI, draft the queue in parallel) cuts cost without hurting UX — users don't notice a 4s draft time when they're already in review mode.

Personalization vs privacy

The more past replies you feed as examples, the better the tone match — but you're also sending that content to the LLM provider. Keep a local summary profile (user's tone, common phrases) and only send 3–5 example replies, not the full corpus.

Auto-send vs always-review

Auto-send on high-confidence responses (<5% of drafts) saves time but one public embarrassment kills trust forever. Stick with review-first until you have 100k+ approved drafts of training data and an allow-list of recipients.

Failure modes & guardrails

Agent drafts a commitment the user can't keep

Mitigation: Run a dedicated 'commitment extractor' pass on every draft. Any draft containing a date, number, or promise-to-do gets a badge in the review UI and never auto-sends regardless of confidence.

Wrong tone — too formal to a friend or too casual to a client

Mitigation: Cluster past replies by recipient into 3–5 tone buckets per user (friend/colleague/client/vendor). Classify recipient first, then prompt the drafter with examples from that bucket only.

Confidential info leaks into logs or LLM provider

Mitigation: Redact attachments and quoted replies below the current message. Use Anthropic zero-retention endpoint. Keep a per-user PII allowlist — never send SSNs, card numbers, or health info, block with regex + classifier.

Classifier labels a critical email as spam

Mitigation: Maintain a VIP allowlist (CRM contacts, boss, past reply recipients) that forces P0. Run a nightly audit job that samples 50 archived-by-agent emails and flags any that got human replies later — retrain thresholds monthly.

Agent replies to an unsubscribe or automated message

Mitigation: Hard-block replies to addresses matching no-reply patterns and List-Unsubscribe headers. Check sender domain against a list of known automation platforms (Mailchimp, Intercom, etc.) before any draft is created.

Frequently asked questions

How much does an email triage agent cost per user per month?

Around $0.40–$2.00 per user per month at scale (75M emails across 25k users). A heavy email user with 10k/month inbound costs ~$8 at full drafting, $2 with classification-only. Most users don't need drafts on every email.

Claude or GPT-4o for tone matching?

Claude Sonnet 4 wins for tone consistency across threads — it's noticeably better at not shifting register mid-reply. GPT-4o drafts feel slightly 'cleaner' but also more generic. Gemini 2.5 Pro lags on tone but is cheaper.

Can the agent actually auto-send replies safely?

Rarely. The 95% pattern in production is drafts-only with a batched review UI. If you must auto-send, restrict to: (1) recipients on a user-specific allowlist, (2) no numbers/dates/commitments in the draft, (3) confidence > 0.9, (4) similarity to a past user-approved reply > 0.85.

How do you handle long email threads that exceed context?

Two-pass: first a summarization pass on the oldest messages (Haiku 4), then include the summary plus the last 3 messages verbatim. Don't rely on 1M-context models for this — thread coherence matters more than full recall.

Does it integrate with Gmail and Outlook both?

Yes — Gmail API and Microsoft Graph cover ~95% of business mail. Nylas is a worthwhile abstraction if you want unified primitives but adds latency and cost. IMAP is a last resort for niche providers.

What happens on a flood of emails (e.g. mailing list thread)?

Dedupe classification by thread — only classify the newest message in a thread per 10-minute window. Rate-limit drafts to N per user per minute. This prevents a reply-all chain from costing $20 in one afternoon.

Is there a real product using this pattern?

Superhuman AI, Shortwave, and 21st.dev all ship variants. The differences are UI (keyboard-driven vs chat) and how aggressively they auto-execute actions. Most settle on drafts + one-key approval as the sweet spot.

Related