task · Use Case

AI for Email Classification and Routing

Automatically classify and route incoming emails — support tickets, sales inquiries, spam, complaints, and billing questions — using multi-label classification with confidence thresholds and human escalation workflows.

Updated Apr 16, 20266 workflows~$0.2–$5 per 1,000 requests

Quick answer

The best email classification stack runs a fast LLM (Claude Haiku or GPT-4o-mini) on each incoming email to classify intent, urgency, and department, then routes to the appropriate queue or agent. Multi-label classification handles emails that span categories (billing complaint + churn risk). Confidence thresholds below 0.75 route to a human triager. Cost runs $0.20-1.50 per 1,000 emails; accuracy on well-defined categories exceeds 92% with few-shot examples.

The problem

Support and sales teams at mid-size companies receive 2,000-15,000 emails per month and spend 15-20% of agent time reading and manually routing messages before any actual work begins. Misrouted emails add an average of 4-8 hours to resolution time and represent 30% of SLA breaches in enterprise support. Spam and low-priority emails consume 20-35% of a support team's queue, crowding out high-value urgent tickets that generate churn and escalations.

Core workflows

Intent Classification and Department Routing

Classify each email into a primary intent (support, billing, sales, legal, press) and route to the appropriate team queue. Reduces manual triage from 3-5 minutes per email to under 2 seconds. Handles 85%+ of emails without human routing.

claude-haiku-3-5langchainArchitecture →

Multi-Label Classification with Urgency Scoring

Assign multiple labels simultaneously (billing + churn-risk + angry, or sales + enterprise + high-value) and generate an urgency score (1-10). High-urgency emails surface to senior agents first. Reduces high-value email response time by 60%.

claude-haiku-3-5humanloopArchitecture →

Spam and Abuse Filtering

Classify and filter spam, phishing, and policy-violating emails before they reach human agents. LLMs catch sophisticated social-engineering and context-aware spam that rule-based filters miss. Reduces spam in agent queues by 95%+.

gpt-4o-miniaws-sesArchitecture →

Churn Risk and VIP Flagging

Detect early signals of customer churn (repeated complaints, downgrade intent, competitor mentions) or identify VIP customers (enterprise contracts, high LTV) for priority escalation. Enables proactive retention outreach.

claude-sonnet-4zapierArchitecture →

Auto-Draft Response Generation

After classification, generate a draft response appropriate to the email's intent and customer history. Agents review and send with one click. Reduces average handle time by 40% while maintaining personalization.

claude-sonnet-4intercomArchitecture →

Confidence-Threshold Human Escalation

When classification confidence falls below 0.75, or when emails match escalation triggers (legal threats, regulatory complaints, media inquiries), automatically flag for senior human review before any automated action is taken.

claude-haiku-3-5humanloopArchitecture →

Top tools

  • langchain
  • humanloop
  • intercom
  • zendesk
  • front-app
  • zapier

Top models

  • claude-haiku-3-5
  • gpt-4o-mini
  • claude-sonnet-4
  • gpt-4o

FAQs

What accuracy can I expect from LLM-based email classification?

With 5-10 few-shot examples per category and well-defined category descriptions, Claude Haiku and GPT-4o-mini typically achieve 92-96% accuracy on clean, well-defined categories (billing, support, sales inquiry, spam). Ambiguous categories (e.g., distinguishing 'feature request' from 'bug report', or 'pricing question' from 'cancellation threat') drop accuracy to 80-88%. Multi-label accuracy — where each email can have multiple correct labels — is lower: 85-92% precision/recall on individual label predictions. The biggest accuracy gains come from investing in clear category definitions with 3-5 bullet points of criteria, rather than just providing examples.

Should I use an LLM or a traditional text classifier (BERT, FastText)?

Use a traditional classifier (fine-tuned BERT, DistilBERT, or even FastText) when: you have 1,000+ labeled training examples per category; your email volume exceeds 1 million/month where LLM costs become significant; categories are stable and well-defined; latency under 50ms is required. Use an LLM when: you have fewer than 200 examples per category or are doing zero-shot classification; you need to classify into 20+ fine-grained categories; your categories change frequently; or you want combined classification + summarization + draft response in a single API call. The crossover point economically is roughly 500,000 emails/month — above that, a fine-tuned BERT model saves significant cost even accounting for fine-tuning and hosting.

How do I set confidence thresholds for human escalation?

Start by measuring your model's calibration — does a self-reported 0.85 confidence actually correspond to 85% accuracy? Most LLMs are overconfident on ambiguous inputs. A practical approach: run your classifier on 500 historical labeled emails, compute accuracy at each confidence decile, and set your escalation threshold at the point where accuracy drops below your acceptable error rate (commonly 90%). For high-stakes categories (legal threats, regulatory complaints, refund requests over $500), apply a separate higher threshold (0.90+) regardless of overall model confidence. Track false positive and false negative rates separately — missing a churn-risk email is usually more costly than mis-routing a routine support ticket.

How do I handle emails that span multiple categories?

Request multi-label output rather than forcing a single category. In your prompt, explicitly instruct the model to return all applicable labels as an array, with a confidence score for each. Then apply routing logic: if an email has both 'billing' and 'churn-risk' labels, route to the retention team rather than the billing team. Build a routing priority hierarchy — churn-risk > legal > billing > support > sales — so multi-label emails always go to the highest-priority queue. For categories that frequently co-occur, analyze if they should be merged into a single label or if the distinction is meaningful for routing purposes.

How do I handle multilingual email queues?

LLMs (GPT-4o, Claude Sonnet 4) classify emails accurately in 30-40 languages without any additional setup — simply include the classification instruction in English and the model handles the non-English email content. For languages where LLM accuracy degrades (less common languages), translate first using a translation API (DeepL, Google Translate) then classify the English translation. Note that translation adds ~100-200ms latency and $0.20-0.50/1,000 characters cost. For high-volume non-English markets, fine-tune a multilingual model (XLM-RoBERTa) for your specific categories — this runs faster and cheaper than LLM classification at scale.

What metadata should I include alongside the email body for better classification?

Providing additional context significantly improves classification accuracy: (1) Sender email domain (gmail.com vs enterprise domain signals consumer vs B2B); (2) Email thread history (is this a follow-up to a previous complaint?); (3) Subject line (often contains the strongest intent signal); (4) Customer tier/LTV from CRM (a $100K/year customer's complaint has different routing priority); (5) Time since last interaction (first contact vs long-time customer); (6) Previous ticket categories for this customer. Including customer tier alone improves urgency scoring accuracy by 8-12%. Always include subject line — it's typically the highest-signal field for intent classification and costs minimal tokens.

Related architectures