Reference Architecture · agent

Email Triage Agent

Q: How much does an email triage agent cost per user per month?

Around $0.40–$2.00 per user per month at scale (75M emails across 25k users). A heavy email user with 10k/month inbound costs ~$8 at full drafting, $2 with classification-only. Most users don't need drafts on every email.

Q: Claude or GPT-4o for tone matching?

Claude Sonnet 4 wins for tone consistency across threads — it's noticeably better at not shifting register mid-reply. GPT-4o drafts feel slightly 'cleaner' but also more generic. Gemini 2.5 Pro lags on tone but is cheaper.

Q: Can the agent actually auto-send replies safely?

Rarely. The 95% pattern in production is drafts-only with a batched review UI. If you must auto-send, restrict to: (1) recipients on a user-specific allowlist, (2) no numbers/dates/commitments in the draft, (3) confidence > 0.9, (4) similarity to a past user-approved reply > 0.85.

Last updated: April 16, 2026

Quick answer

The production stack uses Claude Haiku 4 as the classifier on every inbound email, Claude Sonnet 4 as the drafter for replies, and a strict human-in-the-loop before anything sends. Expect $0.002–$0.01 per email processed and $0.02–$0.05 per drafted reply, with 1–3s latency for classification and 3–6s for a draft. Never run auto-send without confidence thresholds and allow-lists.

The problem

Knowledge workers get 100–300 emails a day and spend 2+ hours on inbox triage. You need an agent that classifies priority, extracts action items, drafts replies for common threads, and never, ever auto-sends something embarrassing. The hard parts are latency on long threads, handling reply-all etiquette, and avoiding hallucinated commitments.

Architecture

input

llm

data

infra

output

Mail Ingest

Watches Gmail or Outlook via push subscription (Pub/Sub or Graph webhook). Pulls full thread context per message.

Alternatives: IMAP polling, Nylas API, Microsoft Graph

Thread Normalizer

Strips quoted replies, signatures, legal footers; resolves sender identity against CRM; extracts attachments.

Alternatives: mailparser, EmailReplyParser, Custom regex + LLM fallback

Priority Classifier

Fast model scores each email on priority (P0–P3), intent (ask/info/spam/newsletter), and action required.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

User Context Store

Stores user VIP list, project context, tone examples, past replies for few-shot prompting.

Alternatives: Supabase, Pinecone namespace per user

Reply Drafter

On P0/P1 that needs reply, generates a draft in the user's tone using retrieved past replies.

Alternatives: GPT-4o, Gemini 2.5 Pro

Action Extractor

Pulls commitments, deadlines, and tasks into a structured list for a todo system.

Alternatives: GPT-4o-mini, Structured-output call on Sonnet

Human Review Panel

Web/mobile UI where user approves drafts, sees priority queue, and can send or edit.

Alternatives: Slack review bot, iMessage approval flow

Send/Label/Archive

Once approved, writes back: sends reply, applies labels, archives newsletters, snoozes low priority.

Alternatives: Gmail API, Outlook Graph, Superhuman integration

The stack

Priority classifierClaude Haiku 4

Runs on every inbound email — cost matters. Haiku 4 classifies priority + intent in <1s for $0.001.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Reply drafterClaude Sonnet 4

Tone matching and multi-turn context are where Sonnet 4 excels. Worth the cost only on drafts, not every email.

Alternatives: GPT-4o, Gemini 2.5 Pro

Mail APIGmail API + Microsoft Graph

Native APIs give push notifications and label/thread primitives. Nylas is a nice abstraction if you need both clouds.

Alternatives: Nylas, IMAP/SMTP

User memory storeSupabase + pgvector

Per-user isolation via RLS matters for privacy. pgvector is plenty fast for <10k vectors per user.

Alternatives: Pinecone, Qdrant

OrchestrationCustom event-driven (Inngest or Trigger.dev)

Email is naturally event-driven with retries. Inngest handles dedup, retries, and fanout cleanly — better fit than a chat-style agent loop.

Alternatives: Temporal, LangGraph

Human review UINext.js + shadcn

Batched review once or twice a day beats per-email notification spam. Keyboard-driven review UI (j/k/enter) is the magic.

Alternatives: iOS Shortcut, Slack bot

Cost at each scale

Prototype

1 user, ~3,000 emails/mo

$12/mo

Haiku 4 classification$4

Sonnet 4 drafts (~300 replies)$6

Supabase (free tier)$0

Inngest (free tier)$0

Hosting (Vercel Hobby)$0

Observability$2

Startup

500 users, ~1.5M emails/mo

$4,800/mo

Haiku 4 classification$1,800

Sonnet 4 drafts (~150k drafts)$2,100

Supabase Pro$120

Inngest Pro$200

Infra (Vercel Pro)$180

Observability (Langfuse)$400

Scale

25,000 users, ~75M emails/mo

$182,000/mo

Haiku 4 classification (cached)$75,000

Sonnet 4 drafts (~7M, cached)$85,000

Supabase Enterprise$8,000

Orchestration (Temporal Cloud)$4,500

Infra$6,000

Observability + evals$3,500

Latency budget

Total P50: 4,900ms

Total P95: 9,670ms

Ingest + normalize

200ms · 700ms p95

Haiku classification

700ms · 1400ms p95

Memory lookup (tone + VIP)

80ms · 220ms p95

Sonnet draft generation

3200ms · 5800ms p95

Action extraction

600ms · 1200ms p95

Write to review queue

120ms · 350ms p95

Median

P95

Tradeoffs

Per-email vs batched drafting

Drafting on every inbound email feels magical but 3–5x the cost. Batched drafting (once user opens the review UI, draft the queue in parallel) cuts cost without hurting UX — users don't notice a 4s draft time when they're already in review mode.

Personalization vs privacy

The more past replies you feed as examples, the better the tone match — but you're also sending that content to the LLM provider. Keep a local summary profile (user's tone, common phrases) and only send 3–5 example replies, not the full corpus.

Auto-send vs always-review

Auto-send on high-confidence responses (<5% of drafts) saves time but one public embarrassment kills trust forever. Stick with review-first until you have 100k+ approved drafts of training data and an allow-list of recipients.

Failure modes & guardrails

Agent drafts a commitment the user can't keep

Mitigation: Run a dedicated 'commitment extractor' pass on every draft. Any draft containing a date, number, or promise-to-do gets a badge in the review UI and never auto-sends regardless of confidence.

Wrong tone — too formal to a friend or too casual to a client

Mitigation: Cluster past replies by recipient into 3–5 tone buckets per user (friend/colleague/client/vendor). Classify recipient first, then prompt the drafter with examples from that bucket only.

Confidential info leaks into logs or LLM provider

Mitigation: Redact attachments and quoted replies below the current message. Use Anthropic zero-retention endpoint. Keep a per-user PII allowlist — never send SSNs, card numbers, or health info, block with regex + classifier.

Classifier labels a critical email as spam

Mitigation: Maintain a VIP allowlist (CRM contacts, boss, past reply recipients) that forces P0. Run a nightly audit job that samples 50 archived-by-agent emails and flags any that got human replies later — retrain thresholds monthly.

Agent replies to an unsubscribe or automated message

Mitigation: Hard-block replies to addresses matching no-reply patterns and List-Unsubscribe headers. Check sender domain against a list of known automation platforms (Mailchimp, Intercom, etc.) before any draft is created.

Frequently asked questions

How much does an email triage agent cost per user per month?

Around $0.40–$2.00 per user per month at scale (75M emails across 25k users). A heavy email user with 10k/month inbound costs ~$8 at full drafting, $2 with classification-only. Most users don't need drafts on every email.

Claude or GPT-4o for tone matching?

Claude Sonnet 4 wins for tone consistency across threads — it's noticeably better at not shifting register mid-reply. GPT-4o drafts feel slightly 'cleaner' but also more generic. Gemini 2.5 Pro lags on tone but is cheaper.

Can the agent actually auto-send replies safely?

Rarely. The 95% pattern in production is drafts-only with a batched review UI. If you must auto-send, restrict to: (1) recipients on a user-specific allowlist, (2) no numbers/dates/commitments in the draft, (3) confidence > 0.9, (4) similarity to a past user-approved reply > 0.85.

How do you handle long email threads that exceed context?

Two-pass: first a summarization pass on the oldest messages (Haiku 4), then include the summary plus the last 3 messages verbatim. Don't rely on 1M-context models for this — thread coherence matters more than full recall.

Does it integrate with Gmail and Outlook both?

Yes — Gmail API and Microsoft Graph cover ~95% of business mail. Nylas is a worthwhile abstraction if you want unified primitives but adds latency and cost. IMAP is a last resort for niche providers.

What happens on a flood of emails (e.g. mailing list thread)?

Dedupe classification by thread — only classify the newest message in a thread per 10-minute window. Rate-limit drafts to N per user per minute. This prevents a reply-all chain from costing $20 in one afternoon.

Is there a real product using this pattern?

Superhuman AI, Shortwave, and 21st.dev all ship variants. The differences are UI (keyboard-driven vs chat) and how aggressively they auto-execute actions. Most settle on drafts + one-key approval as the sweet spot.

Architectures

Calendar Scheduling Agent

Reference architecture for an agent that parses availability, books meetings across timezones, and handles res...

Meeting Notetaker Agent

Reference architecture for an agent that transcribes meetings, extracts action items, and produces structured ...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o-mini

Tools mentioned

supabase pgvector langfuse