Reference Architecture · classification

Sentiment Analysis at Scale

Last updated: April 16, 2026

Quick answer

Run a fine-tuned DistilBERT or XGBoost-on-embeddings classifier on 100% of items for 3-label sentiment at $0.00001 per item. Route the 3-5% with low confidence or business-critical signals (1-star reviews, viral posts, churn-signal tickets) to Claude Haiku 4 or GPT-4o-mini for aspect-level sentiment and rationale. Expect $4k-$15k per billion items, P95 latency 80ms on the fast path.

The problem

You have billions of text items per day (product reviews, tweets, support tickets, call transcripts) and need sentiment labels - positive/negative/neutral at minimum, plus aspect-level sentiment (the user loves the camera, hates the battery). An LLM per item costs $10M+/month at this volume. Traditional ML is 100x cheaper but misses sarcasm and context. You need a tiered system that gets 99% of the value at 1% of the LLM cost.

Architecture

input

llm

data

infra

output

Text Ingest Stream

Kafka or Kinesis stream of incoming text with source metadata (product ID, user ID, channel, language).

Alternatives: Kafka, Kinesis, Pub/Sub, Batch S3 ingest

Language Detection + Preprocess

Detects language, normalizes unicode, strips boilerplate (email signatures, auto-quotes), splits into units suitable for classification.

Alternatives: fasttext langdetect, cld3, Lingua

Embedding Service

Generates dense vector embeddings for each text item. Cached by content hash to avoid recomputation.

Alternatives: BGE-large, Cohere Embed v3, OpenAI text-embedding-3-small

Fast Sentiment Classifier

Traditional ML head (XGBoost or a small neural net) on the embeddings. Outputs 3-label sentiment with calibrated confidence.

Alternatives: Fine-tuned DistilBERT, Setfit, LinearSVC on TF-IDF

Tail Router

Decides which items graduate to LLM analysis: low-confidence items, high-value signals (1-star reviews, viral posts, VIP customers), or aspect-level queries from analysts.

Alternatives: Pure threshold rule, Importance-weighted sampler, Business-rule engine

LLM Aspect + Rationale

Reads the item and returns aspect-level sentiment (camera: positive, battery: negative), sarcasm detection, and a one-sentence rationale.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash, Claude Sonnet 4 for highest-value items

Aggregation Service

Rolls up item-level sentiment to product/brand/topic/time-window. Produces hourly and daily sentiment indices.

Alternatives: ClickHouse materialized views, Druid, Snowflake scheduled jobs

Analyst Dashboard + Alerts

Trend charts, aspect breakdowns, drilldown to raw items. Alerts on sentiment shifts beyond a standard-deviation threshold.

Alternatives: Looker, Metabase, Custom React dashboard, Superset

The stack

Embedding modelVoyage-3

Voyage-3 leads MTEB classification benchmarks in 2026 and is self-hostable. BGE is a strong open-source alternative if you need full on-prem. text-embedding-3-small is cheapest for managed usage at $0.00002 per 1k tokens.

Alternatives: BGE-large-en-v1.5, Cohere Embed v3, OpenAI text-embedding-3-small

Fast classifierXGBoost on cached embeddings

XGBoost trains in minutes on 100k examples, runs at 0.3ms per item on CPU, and gives calibrated probabilities. DistilBERT is 2-3% better on hard cases but costs 10x more to serve. For most teams, XGBoost-on-embeddings is the 80/20 choice.

Alternatives: Fine-tuned DistilBERT, Setfit (few-shot), LinearSVC

LLM tail modelClaude Haiku 4

Haiku 4 handles aspect-based sentiment and sarcasm better than the open-source small models at $0.80/$4 per MTok. GPT-4o-mini is 5x cheaper at $0.15/$0.60 but loses on Asian languages and nuanced sarcasm. Gemini 2.0 Flash is the cheapest at $0.075/$0.30 but worst on sarcasm.

Alternatives: GPT-4o-mini, Gemini 2.0 Flash, Claude Sonnet 4 for VIP items

Stream processingKafka + Flink

Flink handles the stateful windowing for rolling sentiment indices better than Lambda-based systems. Kafka is the lingua franca for streaming at this volume.

Alternatives: Kinesis + Lambda, Pub/Sub + Dataflow, Redpanda + Bytewax

Storage + aggregationClickHouse

ClickHouse handles billions of rows with sub-second queries and materialized views for hourly rollups. Snowflake/BigQuery cost 3-5x more at this volume but integrate better with existing BI stacks.

Alternatives: Druid, Snowflake, BigQuery

EvaluationGolden set + label-studio

Maintain a human-labeled golden set of 5-10k items across domains and languages. Label-Studio or Argilla for annotation. Re-eval any classifier change before rollout.

Alternatives: Braintrust, Argilla, Scale Studio

Cost at each scale

Prototype

1,000,000 items/mo

$85/mo

Embeddings (Voyage-3 or openai-small)$20

XGBoost inference (self-hosted)$5

Claude Haiku 4 tail (5%)$35

ClickHouse Cloud (small)$20

Hosting + observability$5

Startup

1,000,000,000 items/mo

$12,000/mo

Self-hosted embeddings cluster$2,200

XGBoost inference fleet$800

Claude Haiku 4 tail (3%)$4,800

Kafka + Flink$1,500

ClickHouse cluster$1,800

Evals + observability$500

Dashboard infra$400

Scale

30,000,000,000 items/mo

$180,000/mo

Embeddings GPU cluster$42,000

Classifier inference$8,000

Haiku 4 / GPT-4o-mini tail (2%)$78,000

Sonnet 4 VIP items (0.1%)$18,000

Kafka + Flink enterprise$12,000

ClickHouse + storage$14,000

Evals, dashboards, SRE$8,000

Latency budget

Total P50: 590ms

Total P95: 1,216ms

Language detect + preprocess

4ms · 12ms p95

Embedding generation (batched)

25ms · 70ms p95

XGBoost classification

1ms · 4ms p95

Router decision

2ms · 5ms p95

LLM tail (Haiku 4, tail path only)

550ms · 1100ms p95

Aggregation write

8ms · 25ms p95

Median

P95

Tradeoffs

LLM-per-item vs classifier + tail

Sending every item to Claude Sonnet 4 costs $3-5 per million items. A tiered pipeline costs $0.015-$0.05 per million items - a 100x difference. The classifier catches 97% of sentiment signal; the LLM only handles the ambiguous tail where it actually adds value. Do not route every item through an LLM unless you are below 1M items/month.

Aspect-based vs 3-label sentiment

3-label (positive/negative/neutral) covers most business dashboards and is cheap. Aspect-based sentiment (camera: positive, battery: negative) is 10x more valuable for product teams but requires an LLM and adds cost. Compromise: run 3-label on 100% of items, run aspect-based on the tail and on a 1% sampled stream for product-team drilldown.

Multilingual with one model vs per-language pipeline

A single multilingual embedding + classifier is simpler to operate but loses 3-8% accuracy on non-English languages compared to dedicated models. Worth using per-language pipelines (especially for Chinese, Japanese, Arabic) if those markets drive business decisions, not just for English-only reporting.

Failure modes & guardrails

Sarcasm and irony misclassified as positive

Mitigation: Route items with strong positive words but negative context signals (1-star rating, 'NOT' patterns, laughing emoji sequences) to the LLM tail. Train a dedicated sarcasm head on 2-5k labeled sarcastic samples. LLMs catch 85%+ of sarcasm vs 40-60% for classifiers.

Concept drift - sentiment targets change meaning over time

Mitigation: Re-train the classifier monthly on recent labeled data. Monitor prediction distribution: if the share of 'positive' jumps 10%+ week-over-week with no business explanation, that is drift. Alert and re-label.

Coordinated review campaigns skew brand sentiment

Mitigation: Detect bursts: items from new accounts, same IP range, or same template text arriving in short windows. Flag bursts and exclude from brand sentiment index until reviewed. Do not let brigades or bot campaigns move the official number.

Non-English traffic is systematically mislabeled

Mitigation: Log accuracy per language on the golden set. If Japanese accuracy is 8% below English, that is visible. Fix with a Japanese-specific classifier or route Japanese items to an LLM tail at higher rates. Do not ship one accuracy number hiding language-level failures.

Dashboard users misinterpret sentiment movements

Mitigation: Show confidence intervals on every sentiment number. Require minimum sample sizes (500 items) before declaring a trend. Surface the top contributing items so analysts can validate. Never show a raw number without context - it will be weaponized in a meeting.

Frequently asked questions

Do I need an LLM for sentiment analysis at all?

For coarse sentiment (positive/negative/neutral) at billion-scale, no - a fine-tuned classifier gets 88-92% accuracy at 1/100th the cost. For aspect-based sentiment, sarcasm, and rationales, yes - LLMs are markedly better. The tiered pipeline lets you get both without paying LLM prices on every item.

Which embedding model for sentiment?

Voyage-3 leads MTEB classification in 2026. BGE-large is the best open-source option. Cohere Embed v3 is solid and has a good managed API. OpenAI text-embedding-3-small is cheapest ($0.00002/1k tokens) and good enough for most sentiment work - the bottleneck is the classifier head, not the embedding.

How much training data does the fast classifier need?

XGBoost on quality embeddings reaches 85% accuracy with 5k labeled examples and 90%+ with 30-50k. DistilBERT needs 50-100k for similar gains. Budget a week of labeling with 2-3 domain experts on a platform like Label-Studio to bootstrap your golden set and training data.

How often should I re-train?

Monthly for high-change domains (social media, news). Quarterly for stable domains (product reviews). Always re-train after a product launch, platform rule change, or major cultural event - these shift sentiment baselines.

Claude Haiku 4 vs GPT-4o-mini for the tail?

Haiku 4 is better at sarcasm and code-switched multilingual text but costs 5x more ($0.80/$4 vs $0.15/$0.60 per MTok). Start with GPT-4o-mini for cost. Move to Haiku 4 if sarcasm and aspect-quality matter and your tail volume justifies the spend. Gemini 2.0 Flash ($0.075/$0.30) is cheapest but worst on nuanced English.

How do I detect aspect-based sentiment cheaply?

Two options. (1) Prompt the LLM to return structured JSON with aspects + scores on the tail only. (2) Train a dedicated aspect extractor on 10-20k labeled examples (pyABSA or a fine-tuned T5) and run it on a 1-5% sampled stream. Option 2 is cheaper at scale but requires labeled data.

Architectures

Realtime Content Moderation Pipeline

Reference architecture for moderating user-generated text and images in realtime. Tiered policy classifier, hu...

Intent Classification for Message Routing

Reference architecture for multi-label intent classification routing inbound customer messages to the right te...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-haiku-4 gpt-4o-mini gemini-2-0-flash