Reference Architecture · classification
Intent Classification for Message Routing
Last updated: April 16, 2026
Quick answer
Run a multi-label fine-tuned classifier (DistilBERT or XGBoost on Voyage-3 embeddings) with per-label confidence thresholds. Route confident labels directly; route low-confidence messages to Claude Haiku 4 for reasoning plus routing. Keep a 'human review' fallback for messages that fail all thresholds. Expect 92-96% routing accuracy and $0.0002 per message at scale, with P95 latency under 150ms.
The problem
Inbound messages land in a single inbox (email, chat, SMS, in-app). You need to route each one to the right queue - support, sales, billing, refunds, abuse, bug reports - within 200ms. Some messages have multiple intents (a billing question about a bug). Some are ambiguous and need fallback. Misrouting costs real money: sales leads go cold, abuse gets ignored, and customers rage-tweet about the wrong team responding.
Architecture
Inbound Channels
Unified ingest from email, chat widget, SMS, WhatsApp, social DMs, and app-embedded forms. Normalizes to a common message schema.
Alternatives: Front, Kustomer, Custom ingest service, Segment
Preprocess + Dedup
Strips signatures, quotes, HTML. Detects language. Dedupes repeat messages from the same sender within 60s.
Alternatives: mailparse + custom, Postmark inbound, SendGrid inbound parse
Embedding Service
Generates dense vectors for classifier input and for similar-past-message retrieval.
Alternatives: BGE-large, Cohere Embed v3, OpenAI text-embedding-3-small
Multi-Label Intent Classifier
Per-label sigmoid classifier returning calibrated probabilities for each intent (support, sales, billing, abuse, bug, partnership, press, etc.).
Alternatives: Setfit, XGBoost per-label, Claude Haiku 4 as classifier
Threshold + Routing Engine
Applies per-label thresholds and business rules. Messages passing thresholds route immediately; tied or low-confidence messages escalate.
Alternatives: Open Policy Agent, JSON rules engine, Custom Python service
LLM Fallback Classifier
Handles low-confidence and multi-intent messages. Reads label definitions and returns a structured decision with rationale.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash, Claude Sonnet 4 for tricky cases
Team Queue Router
Places the message in the correct team queue (Zendesk view, Salesforce case queue, PagerDuty for abuse) with priority and SLA.
Alternatives: Zendesk, Salesforce Service Cloud, Intercom, Custom queue service
Routing Feedback Loop
Tracks re-routes by humans. When an agent reassigns a message from support to billing, that is a training signal. Surfaces weekly to retrain.
Alternatives: Label-Studio, Argilla, Custom eval dashboard
The stack
Voyage-3 leads on short-text classification benchmarks in 2026. BGE is the strongest open alternative. OpenAI small is cheapest and good enough if intents are well-separated.
Alternatives: BGE-large-en-v1.5, Cohere Embed v3, OpenAI text-embedding-3-small
DistilBERT fine-tuned on 20-50k labeled messages hits 93-96% macro-F1 at 30ms inference. Setfit is excellent if you only have 100-500 labels per class. Zero-shot LLMs can bootstrap before you have training data but cost 50x more per message.
Alternatives: XGBoost per-label, Setfit few-shot, Claude Haiku 4 zero-shot
Haiku 4 follows structured label definitions better than GPT-4o-mini on ambiguous messages. GPT-4o-mini is 5x cheaper and good enough if your label set is well-defined. Avoid Opus/Sonnet on the hot path - overkill for a classification task.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash
OPA lets non-engineers (support ops leads) tune per-label thresholds and route rules without a deploy. Critical when routing logic changes weekly based on team capacity or new product launches.
Alternatives: Custom Python service, AWS Step Functions, Temporal
Zendesk and Salesforce dominate enterprise customer operations. Sunshine gives you custom object routing. Intercom is better for product-led companies. Do not build your own queue - you will lose 3 months to it.
Alternatives: Intercom, Front, Kustomer
Argilla for ongoing labeling by support ops. Braintrust for eval runs on each classifier change. You need weekly evals because intents drift as product features and customer base change.
Alternatives: Label-Studio, Snorkel, Scale Studio
Cost at each scale
Prototype
50,000 messages/mo
$60/mo
Startup
5,000,000 messages/mo
$1,400/mo
Scale
500,000,000 messages/mo
$48,000/mo
Latency budget
Tradeoffs
Zero-shot LLM vs fine-tuned classifier
Zero-shot Claude Haiku 4 gets 82-88% accuracy out of the box with no training data. A fine-tuned DistilBERT on 30k labels reaches 93-96%. The fine-tuned model costs 50-100x less per inference. Start zero-shot to bootstrap labels; graduate to fine-tuned once you have enough data.
Single label vs multi-label
Single-label classifiers are simpler to train and interpret but fail on messages like 'my card was charged twice AND the app crashes on checkout' (billing + bug). Multi-label sigmoid heads handle these but require labelers to mark multiple intents consistently. Most teams should go multi-label from day one.
Per-label threshold vs global threshold
A global confidence threshold over-routes common intents and under-routes rare ones (abuse, press). Per-label thresholds tuned on ROC curves give 3-5% better routing accuracy. The tradeoff is complexity - you need to eval and tune each label separately.
Failure modes & guardrails
New intent appears (new product launch) and classifier has no label for it
Mitigation: Monitor the proportion of messages routed to 'other/unknown'. When it rises above 3% for a week, that is a signal. Add the new label to training data, bootstrap with 200-500 labeled examples using Argilla, re-train, and deploy.
Abuse or legal-threat messages routed to regular support queue
Mitigation: Set the abuse/legal-threat threshold very low - err on false positives. A human agent quickly confirms. Add regex/keyword pre-routing for explicit threat signals ('sue', 'lawyer', 'lawsuit', slurs) as a belt-and-suspenders guardrail.
Classifier drifts as product and customer base change
Mitigation: Sample 200-500 messages per week, have support ops label them, and diff against model predictions. Re-train monthly. Show drift metrics (macro-F1 trend, per-label recall trend) on a visible dashboard.
Multi-intent messages get single-routed and the second intent gets lost
Mitigation: For multi-label predictions above threshold, fan out - create linked tickets in both queues with a shared parent. Train support agents to close linked tickets as a unit.
Non-English messages routed to teams without non-English staff
Mitigation: Route by detected language first, then by intent. If a team has no speaker of that language, translate with GPT-4o-mini or Gemini 2.0 Flash and attach the translation to the ticket. Do not drop non-English messages to the bottom of the queue.
Frequently asked questions
How many intent labels should I have?
Start with 6-10 coarse labels (support, sales, billing, bug, abuse, partnership, press, other). Below 6 and you lose routing precision; above 15 and your labelers disagree, your classifier dilutes, and your accuracy plummets. Add sub-labels (billing-refund, billing-charge-dispute) only when volume in a parent label justifies a dedicated team.
How much labeled data do I need?
Fine-tuned DistilBERT needs 500-1000 labels per class for 90%+ accuracy. Setfit and few-shot LLM approaches can work with 100-200 per class for 85%. If you have zero labels, start with zero-shot Haiku 4 or Claude Sonnet 4 and backfill labels from production traffic.
Should I use an LLM as the primary classifier?
Only if volume is under 100k messages/month or you have strong latency budget. At higher volume, a fine-tuned classifier is 50-100x cheaper per message and 5-10x faster. The LLM belongs on the low-confidence tail and the feedback loop, not the hot path.
How do I set per-label thresholds?
Plot the precision-recall curve per label on a validation set. Pick the threshold that gives the business-required precision (high for abuse/legal, moderate for support/sales). Store thresholds in config, not code, so ops can tune them without deploys.
What's the biggest gotcha?
Label noise. If two support ops people disagree on whether 'my invoice looks weird' is 'billing' or 'support', your classifier will never exceed the human agreement ceiling (often 85-92%). Measure inter-annotator agreement before measuring model accuracy. If IAA is below 85%, your labels - not your model - are the problem.
Which cloud provider's classifier services are worth using?
AWS Comprehend Custom Classification is decent for English but falls behind fine-tuned DistilBERT on specific domains. Google Vertex AI AutoML is similar. Azure Language Studio is the weakest. Self-hosted fine-tuned DistilBERT or Setfit wins on cost and accuracy if you have in-house ML ops capability.
Related
Architectures
Realtime Content Moderation Pipeline
Reference architecture for moderating user-generated text and images in realtime. Tiered policy classifier, hu...
Sentiment Analysis at Scale
Reference architecture for classifying sentiment across billions of reviews, social posts, and support message...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...