Question 1

What accuracy can I expect from LLM-based email classification?

Accepted Answer

With 5-10 few-shot examples per category and well-defined category descriptions, Claude Haiku and GPT-4o-mini typically achieve 92-96% accuracy on clean, well-defined categories (billing, support, sales inquiry, spam). Ambiguous categories (e.g., distinguishing 'feature request' from 'bug report', or 'pricing question' from 'cancellation threat') drop accuracy to 80-88%. Multi-label accuracy — where each email can have multiple correct labels — is lower: 85-92% precision/recall on individual label predictions. The biggest accuracy gains come from investing in clear category definitions with 3-5 bullet points of criteria, rather than just providing examples.

Question 2

Should I use an LLM or a traditional text classifier (BERT, FastText)?

Accepted Answer

Use a traditional classifier (fine-tuned BERT, DistilBERT, or even FastText) when: you have 1,000+ labeled training examples per category; your email volume exceeds 1 million/month where LLM costs become significant; categories are stable and well-defined; latency under 50ms is required. Use an LLM when: you have fewer than 200 examples per category or are doing zero-shot classification; you need to classify into 20+ fine-grained categories; your categories change frequently; or you want combined classification + summarization + draft response in a single API call. The crossover point economically is roughly 500,000 emails/month — above that, a fine-tuned BERT model saves significant cost even accounting for fine-tuning and hosting.

Question 3

How do I set confidence thresholds for human escalation?

Accepted Answer

Start by measuring your model's calibration — does a self-reported 0.85 confidence actually correspond to 85% accuracy? Most LLMs are overconfident on ambiguous inputs. A practical approach: run your classifier on 500 historical labeled emails, compute accuracy at each confidence decile, and set your escalation threshold at the point where accuracy drops below your acceptable error rate (commonly 90%). For high-stakes categories (legal threats, regulatory complaints, refund requests over $500), apply a separate higher threshold (0.90+) regardless of overall model confidence. Track false positive and false negative rates separately — missing a churn-risk email is usually more costly than mis-routing a routine support ticket.

Question 4

How do I handle emails that span multiple categories?

Accepted Answer

Request multi-label output rather than forcing a single category. In your prompt, explicitly instruct the model to return all applicable labels as an array, with a confidence score for each. Then apply routing logic: if an email has both 'billing' and 'churn-risk' labels, route to the retention team rather than the billing team. Build a routing priority hierarchy — churn-risk > legal > billing > support > sales — so multi-label emails always go to the highest-priority queue. For categories that frequently co-occur, analyze if they should be merged into a single label or if the distinction is meaningful for routing purposes.

Question 5

How do I handle multilingual email queues?

Accepted Answer

LLMs (GPT-4o, Claude Sonnet 4) classify emails accurately in 30-40 languages without any additional setup — simply include the classification instruction in English and the model handles the non-English email content. For languages where LLM accuracy degrades (less common languages), translate first using a translation API (DeepL, Google Translate) then classify the English translation. Note that translation adds ~100-200ms latency and $0.20-0.50/1,000 characters cost. For high-volume non-English markets, fine-tune a multilingual model (XLM-RoBERTa) for your specific categories — this runs faster and cheaper than LLM classification at scale.

Question 6

What metadata should I include alongside the email body for better classification?

Accepted Answer

Providing additional context significantly improves classification accuracy: (1) Sender email domain (gmail.com vs enterprise domain signals consumer vs B2B); (2) Email thread history (is this a follow-up to a previous complaint?); (3) Subject line (often contains the strongest intent signal); (4) Customer tier/LTV from CRM (a $100K/year customer's complaint has different routing priority); (5) Time since last interaction (first contact vs long-time customer); (6) Previous ticket categories for this customer. Including customer tier alone improves urgency scoring accuracy by 8-12%. Always include subject line — it's typically the highest-signal field for intent classification and costs minimal tokens.

AI for Email Classification and Routing

The problem

Core workflows

Intent Classification and Department Routing

Multi-Label Classification with Urgency Scoring

Spam and Abuse Filtering

Churn Risk and VIP Flagging

Auto-Draft Response Generation

Confidence-Threshold Human Escalation

Top tools

Top models

FAQs

Related architectures