Reference Architecture · agent
Email Triage Agent
Last updated: April 16, 2026
Quick answer
The production stack uses Claude Haiku 4 as the classifier on every inbound email, Claude Sonnet 4 as the drafter for replies, and a strict human-in-the-loop before anything sends. Expect $0.002–$0.01 per email processed and $0.02–$0.05 per drafted reply, with 1–3s latency for classification and 3–6s for a draft. Never run auto-send without confidence thresholds and allow-lists.
The problem
Knowledge workers get 100–300 emails a day and spend 2+ hours on inbox triage. You need an agent that classifies priority, extracts action items, drafts replies for common threads, and never, ever auto-sends something embarrassing. The hard parts are latency on long threads, handling reply-all etiquette, and avoiding hallucinated commitments.
Architecture
Mail Ingest
Watches Gmail or Outlook via push subscription (Pub/Sub or Graph webhook). Pulls full thread context per message.
Alternatives: IMAP polling, Nylas API, Microsoft Graph
Thread Normalizer
Strips quoted replies, signatures, legal footers; resolves sender identity against CRM; extracts attachments.
Alternatives: mailparser, EmailReplyParser, Custom regex + LLM fallback
Priority Classifier
Fast model scores each email on priority (P0–P3), intent (ask/info/spam/newsletter), and action required.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
User Context Store
Stores user VIP list, project context, tone examples, past replies for few-shot prompting.
Alternatives: Supabase, Pinecone namespace per user
Reply Drafter
On P0/P1 that needs reply, generates a draft in the user's tone using retrieved past replies.
Alternatives: GPT-4o, Gemini 2.5 Pro
Action Extractor
Pulls commitments, deadlines, and tasks into a structured list for a todo system.
Alternatives: GPT-4o-mini, Structured-output call on Sonnet
Human Review Panel
Web/mobile UI where user approves drafts, sees priority queue, and can send or edit.
Alternatives: Slack review bot, iMessage approval flow
Send/Label/Archive
Once approved, writes back: sends reply, applies labels, archives newsletters, snoozes low priority.
Alternatives: Gmail API, Outlook Graph, Superhuman integration
The stack
Runs on every inbound email — cost matters. Haiku 4 classifies priority + intent in <1s for $0.001.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
Tone matching and multi-turn context are where Sonnet 4 excels. Worth the cost only on drafts, not every email.
Alternatives: GPT-4o, Gemini 2.5 Pro
Native APIs give push notifications and label/thread primitives. Nylas is a nice abstraction if you need both clouds.
Alternatives: Nylas, IMAP/SMTP
Per-user isolation via RLS matters for privacy. pgvector is plenty fast for <10k vectors per user.
Alternatives: Pinecone, Qdrant
Email is naturally event-driven with retries. Inngest handles dedup, retries, and fanout cleanly — better fit than a chat-style agent loop.
Alternatives: Temporal, LangGraph
Batched review once or twice a day beats per-email notification spam. Keyboard-driven review UI (j/k/enter) is the magic.
Alternatives: iOS Shortcut, Slack bot
Cost at each scale
Prototype
1 user, ~3,000 emails/mo
$12/mo
Startup
500 users, ~1.5M emails/mo
$4,800/mo
Scale
25,000 users, ~75M emails/mo
$182,000/mo
Latency budget
Tradeoffs
Per-email vs batched drafting
Drafting on every inbound email feels magical but 3–5x the cost. Batched drafting (once user opens the review UI, draft the queue in parallel) cuts cost without hurting UX — users don't notice a 4s draft time when they're already in review mode.
Personalization vs privacy
The more past replies you feed as examples, the better the tone match — but you're also sending that content to the LLM provider. Keep a local summary profile (user's tone, common phrases) and only send 3–5 example replies, not the full corpus.
Auto-send vs always-review
Auto-send on high-confidence responses (<5% of drafts) saves time but one public embarrassment kills trust forever. Stick with review-first until you have 100k+ approved drafts of training data and an allow-list of recipients.
Failure modes & guardrails
Agent drafts a commitment the user can't keep
Mitigation: Run a dedicated 'commitment extractor' pass on every draft. Any draft containing a date, number, or promise-to-do gets a badge in the review UI and never auto-sends regardless of confidence.
Wrong tone — too formal to a friend or too casual to a client
Mitigation: Cluster past replies by recipient into 3–5 tone buckets per user (friend/colleague/client/vendor). Classify recipient first, then prompt the drafter with examples from that bucket only.
Confidential info leaks into logs or LLM provider
Mitigation: Redact attachments and quoted replies below the current message. Use Anthropic zero-retention endpoint. Keep a per-user PII allowlist — never send SSNs, card numbers, or health info, block with regex + classifier.
Classifier labels a critical email as spam
Mitigation: Maintain a VIP allowlist (CRM contacts, boss, past reply recipients) that forces P0. Run a nightly audit job that samples 50 archived-by-agent emails and flags any that got human replies later — retrain thresholds monthly.
Agent replies to an unsubscribe or automated message
Mitigation: Hard-block replies to addresses matching no-reply patterns and List-Unsubscribe headers. Check sender domain against a list of known automation platforms (Mailchimp, Intercom, etc.) before any draft is created.
Frequently asked questions
How much does an email triage agent cost per user per month?
Around $0.40–$2.00 per user per month at scale (75M emails across 25k users). A heavy email user with 10k/month inbound costs ~$8 at full drafting, $2 with classification-only. Most users don't need drafts on every email.
Claude or GPT-4o for tone matching?
Claude Sonnet 4 wins for tone consistency across threads — it's noticeably better at not shifting register mid-reply. GPT-4o drafts feel slightly 'cleaner' but also more generic. Gemini 2.5 Pro lags on tone but is cheaper.
Can the agent actually auto-send replies safely?
Rarely. The 95% pattern in production is drafts-only with a batched review UI. If you must auto-send, restrict to: (1) recipients on a user-specific allowlist, (2) no numbers/dates/commitments in the draft, (3) confidence > 0.9, (4) similarity to a past user-approved reply > 0.85.
How do you handle long email threads that exceed context?
Two-pass: first a summarization pass on the oldest messages (Haiku 4), then include the summary plus the last 3 messages verbatim. Don't rely on 1M-context models for this — thread coherence matters more than full recall.
Does it integrate with Gmail and Outlook both?
Yes — Gmail API and Microsoft Graph cover ~95% of business mail. Nylas is a worthwhile abstraction if you want unified primitives but adds latency and cost. IMAP is a last resort for niche providers.
What happens on a flood of emails (e.g. mailing list thread)?
Dedupe classification by thread — only classify the newest message in a thread per 10-minute window. Rate-limit drafts to N per user per minute. This prevents a reply-all chain from costing $20 in one afternoon.
Is there a real product using this pattern?
Superhuman AI, Shortwave, and 21st.dev all ship variants. The differences are UI (keyboard-driven vs chat) and how aggressively they auto-execute actions. Most settle on drafts + one-key approval as the sweet spot.
Related
Architectures
Calendar Scheduling Agent
Reference architecture for an agent that parses availability, books meetings across timezones, and handles res...
Meeting Notetaker Agent
Reference architecture for an agent that transcribes meetings, extracts action items, and produces structured ...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...