Reference Architecture · classification

Intent Classification for Message Routing

Last updated: April 16, 2026

Quick answer

Run a multi-label fine-tuned classifier (DistilBERT or XGBoost on Voyage-3 embeddings) with per-label confidence thresholds. Route confident labels directly; route low-confidence messages to Claude Haiku 4 for reasoning plus routing. Keep a 'human review' fallback for messages that fail all thresholds. Expect 92-96% routing accuracy and $0.0002 per message at scale, with P95 latency under 150ms.

The problem

Inbound messages land in a single inbox (email, chat, SMS, in-app). You need to route each one to the right queue - support, sales, billing, refunds, abuse, bug reports - within 200ms. Some messages have multiple intents (a billing question about a bug). Some are ambiguous and need fallback. Misrouting costs real money: sales leads go cold, abuse gets ignored, and customers rage-tweet about the wrong team responding.

Architecture

input

llm

data

infra

output

Inbound Channels

Unified ingest from email, chat widget, SMS, WhatsApp, social DMs, and app-embedded forms. Normalizes to a common message schema.

Alternatives: Front, Kustomer, Custom ingest service, Segment

Preprocess + Dedup

Strips signatures, quotes, HTML. Detects language. Dedupes repeat messages from the same sender within 60s.

Alternatives: mailparse + custom, Postmark inbound, SendGrid inbound parse

Embedding Service

Generates dense vectors for classifier input and for similar-past-message retrieval.

Alternatives: BGE-large, Cohere Embed v3, OpenAI text-embedding-3-small

Multi-Label Intent Classifier

Per-label sigmoid classifier returning calibrated probabilities for each intent (support, sales, billing, abuse, bug, partnership, press, etc.).

Alternatives: Setfit, XGBoost per-label, Claude Haiku 4 as classifier

Threshold + Routing Engine

Applies per-label thresholds and business rules. Messages passing thresholds route immediately; tied or low-confidence messages escalate.

Alternatives: Open Policy Agent, JSON rules engine, Custom Python service

LLM Fallback Classifier

Handles low-confidence and multi-intent messages. Reads label definitions and returns a structured decision with rationale.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash, Claude Sonnet 4 for tricky cases

Team Queue Router

Places the message in the correct team queue (Zendesk view, Salesforce case queue, PagerDuty for abuse) with priority and SLA.

Alternatives: Zendesk, Salesforce Service Cloud, Intercom, Custom queue service

Routing Feedback Loop

Tracks re-routes by humans. When an agent reassigns a message from support to billing, that is a training signal. Surfaces weekly to retrain.

Alternatives: Label-Studio, Argilla, Custom eval dashboard

The stack

Embedding modelVoyage-3

Voyage-3 leads on short-text classification benchmarks in 2026. BGE is the strongest open alternative. OpenAI small is cheapest and good enough if intents are well-separated.

Alternatives: BGE-large-en-v1.5, Cohere Embed v3, OpenAI text-embedding-3-small

Primary classifierDistilBERT fine-tuned

DistilBERT fine-tuned on 20-50k labeled messages hits 93-96% macro-F1 at 30ms inference. Setfit is excellent if you only have 100-500 labels per class. Zero-shot LLMs can bootstrap before you have training data but cost 50x more per message.

Alternatives: XGBoost per-label, Setfit few-shot, Claude Haiku 4 zero-shot

LLM fallbackClaude Haiku 4

Haiku 4 follows structured label definitions better than GPT-4o-mini on ambiguous messages. GPT-4o-mini is 5x cheaper and good enough if your label set is well-defined. Avoid Opus/Sonnet on the hot path - overkill for a classification task.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash

Threshold engineOPA (Open Policy Agent)

OPA lets non-engineers (support ops leads) tune per-label thresholds and route rules without a deploy. Critical when routing logic changes weekly based on team capacity or new product launches.

Alternatives: Custom Python service, AWS Step Functions, Temporal

Queue integrationZendesk Sunshine / Salesforce API

Zendesk and Salesforce dominate enterprise customer operations. Sunshine gives you custom object routing. Intercom is better for product-led companies. Do not build your own queue - you will lose 3 months to it.

Alternatives: Intercom, Front, Kustomer

EvalsArgilla + Braintrust

Argilla for ongoing labeling by support ops. Braintrust for eval runs on each classifier change. You need weekly evals because intents drift as product features and customer base change.

Alternatives: Label-Studio, Snorkel, Scale Studio

Cost at each scale

Prototype

50,000 messages/mo

$60/mo

Embeddings (OpenAI small)$5

DistilBERT inference (self-hosted)$10

Claude Haiku 4 fallback (8%)$15

Hosting (Vercel + Fly.io)$15

Eval tooling$15

Startup

5,000,000 messages/mo

$1,400/mo

Embeddings (self-hosted)$220

DistilBERT inference fleet$240

Claude Haiku 4 fallback (4%)$420

OPA + routing service$150

ClickHouse + dashboards$180

Argilla + Braintrust$190

Scale

500,000,000 messages/mo

$48,000/mo

Embedding GPU cluster$11,000

Classifier inference$4,200

Claude Haiku 4 fallback (2.5%)$18,500

Routing + policy infra$3,500

ClickHouse + storage$4,800

Evals, labeling ops$6,000

Latency budget

Total P50: 580ms

Total P95: 1,208ms

Preprocess + language detect

5ms · 15ms p95

Embedding (cached for exact matches)

22ms · 60ms p95

Multi-label classifier

25ms · 65ms p95

Threshold gate + routing rules

8ms · 18ms p95

LLM fallback (tail only)

520ms · 1050ms p95

Median

P95

Tradeoffs

Zero-shot LLM vs fine-tuned classifier

Zero-shot Claude Haiku 4 gets 82-88% accuracy out of the box with no training data. A fine-tuned DistilBERT on 30k labels reaches 93-96%. The fine-tuned model costs 50-100x less per inference. Start zero-shot to bootstrap labels; graduate to fine-tuned once you have enough data.

Single label vs multi-label

Single-label classifiers are simpler to train and interpret but fail on messages like 'my card was charged twice AND the app crashes on checkout' (billing + bug). Multi-label sigmoid heads handle these but require labelers to mark multiple intents consistently. Most teams should go multi-label from day one.

Per-label threshold vs global threshold

A global confidence threshold over-routes common intents and under-routes rare ones (abuse, press). Per-label thresholds tuned on ROC curves give 3-5% better routing accuracy. The tradeoff is complexity - you need to eval and tune each label separately.

Failure modes & guardrails

New intent appears (new product launch) and classifier has no label for it

Mitigation: Monitor the proportion of messages routed to 'other/unknown'. When it rises above 3% for a week, that is a signal. Add the new label to training data, bootstrap with 200-500 labeled examples using Argilla, re-train, and deploy.

Abuse or legal-threat messages routed to regular support queue

Mitigation: Set the abuse/legal-threat threshold very low - err on false positives. A human agent quickly confirms. Add regex/keyword pre-routing for explicit threat signals ('sue', 'lawyer', 'lawsuit', slurs) as a belt-and-suspenders guardrail.

Classifier drifts as product and customer base change

Mitigation: Sample 200-500 messages per week, have support ops label them, and diff against model predictions. Re-train monthly. Show drift metrics (macro-F1 trend, per-label recall trend) on a visible dashboard.

Multi-intent messages get single-routed and the second intent gets lost

Mitigation: For multi-label predictions above threshold, fan out - create linked tickets in both queues with a shared parent. Train support agents to close linked tickets as a unit.

Non-English messages routed to teams without non-English staff

Mitigation: Route by detected language first, then by intent. If a team has no speaker of that language, translate with GPT-4o-mini or Gemini 2.0 Flash and attach the translation to the ticket. Do not drop non-English messages to the bottom of the queue.

Frequently asked questions

How many intent labels should I have?

Start with 6-10 coarse labels (support, sales, billing, bug, abuse, partnership, press, other). Below 6 and you lose routing precision; above 15 and your labelers disagree, your classifier dilutes, and your accuracy plummets. Add sub-labels (billing-refund, billing-charge-dispute) only when volume in a parent label justifies a dedicated team.

How much labeled data do I need?

Fine-tuned DistilBERT needs 500-1000 labels per class for 90%+ accuracy. Setfit and few-shot LLM approaches can work with 100-200 per class for 85%. If you have zero labels, start with zero-shot Haiku 4 or Claude Sonnet 4 and backfill labels from production traffic.

Should I use an LLM as the primary classifier?

Only if volume is under 100k messages/month or you have strong latency budget. At higher volume, a fine-tuned classifier is 50-100x cheaper per message and 5-10x faster. The LLM belongs on the low-confidence tail and the feedback loop, not the hot path.

How do I set per-label thresholds?

Plot the precision-recall curve per label on a validation set. Pick the threshold that gives the business-required precision (high for abuse/legal, moderate for support/sales). Store thresholds in config, not code, so ops can tune them without deploys.

What's the biggest gotcha?

Label noise. If two support ops people disagree on whether 'my invoice looks weird' is 'billing' or 'support', your classifier will never exceed the human agreement ceiling (often 85-92%). Measure inter-annotator agreement before measuring model accuracy. If IAA is below 85%, your labels - not your model - are the problem.

Which cloud provider's classifier services are worth using?

AWS Comprehend Custom Classification is decent for English but falls behind fine-tuned DistilBERT on specific domains. Google Vertex AI AutoML is similar. Azure Language Studio is the weakest. Self-hosted fine-tuned DistilBERT or Setfit wins on cost and accuracy if you have in-house ML ops capability.

Architectures

Realtime Content Moderation Pipeline

Reference architecture for moderating user-generated text and images in realtime. Tiered policy classifier, hu...

Sentiment Analysis at Scale

Reference architecture for classifying sentiment across billions of reviews, social posts, and support message...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-haiku-4 gpt-4o-mini gemini-2-0-flash