Reference Architecture · classification
Sentiment Analysis at Scale
Last updated: April 16, 2026
Quick answer
Run a fine-tuned DistilBERT or XGBoost-on-embeddings classifier on 100% of items for 3-label sentiment at $0.00001 per item. Route the 3-5% with low confidence or business-critical signals (1-star reviews, viral posts, churn-signal tickets) to Claude Haiku 4 or GPT-4o-mini for aspect-level sentiment and rationale. Expect $4k-$15k per billion items, P95 latency 80ms on the fast path.
The problem
You have billions of text items per day (product reviews, tweets, support tickets, call transcripts) and need sentiment labels - positive/negative/neutral at minimum, plus aspect-level sentiment (the user loves the camera, hates the battery). An LLM per item costs $10M+/month at this volume. Traditional ML is 100x cheaper but misses sarcasm and context. You need a tiered system that gets 99% of the value at 1% of the LLM cost.
Architecture
Text Ingest Stream
Kafka or Kinesis stream of incoming text with source metadata (product ID, user ID, channel, language).
Alternatives: Kafka, Kinesis, Pub/Sub, Batch S3 ingest
Language Detection + Preprocess
Detects language, normalizes unicode, strips boilerplate (email signatures, auto-quotes), splits into units suitable for classification.
Alternatives: fasttext langdetect, cld3, Lingua
Embedding Service
Generates dense vector embeddings for each text item. Cached by content hash to avoid recomputation.
Alternatives: BGE-large, Cohere Embed v3, OpenAI text-embedding-3-small
Fast Sentiment Classifier
Traditional ML head (XGBoost or a small neural net) on the embeddings. Outputs 3-label sentiment with calibrated confidence.
Alternatives: Fine-tuned DistilBERT, Setfit, LinearSVC on TF-IDF
Tail Router
Decides which items graduate to LLM analysis: low-confidence items, high-value signals (1-star reviews, viral posts, VIP customers), or aspect-level queries from analysts.
Alternatives: Pure threshold rule, Importance-weighted sampler, Business-rule engine
LLM Aspect + Rationale
Reads the item and returns aspect-level sentiment (camera: positive, battery: negative), sarcasm detection, and a one-sentence rationale.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash, Claude Sonnet 4 for highest-value items
Aggregation Service
Rolls up item-level sentiment to product/brand/topic/time-window. Produces hourly and daily sentiment indices.
Alternatives: ClickHouse materialized views, Druid, Snowflake scheduled jobs
Analyst Dashboard + Alerts
Trend charts, aspect breakdowns, drilldown to raw items. Alerts on sentiment shifts beyond a standard-deviation threshold.
Alternatives: Looker, Metabase, Custom React dashboard, Superset
The stack
Voyage-3 leads MTEB classification benchmarks in 2026 and is self-hostable. BGE is a strong open-source alternative if you need full on-prem. text-embedding-3-small is cheapest for managed usage at $0.00002 per 1k tokens.
Alternatives: BGE-large-en-v1.5, Cohere Embed v3, OpenAI text-embedding-3-small
XGBoost trains in minutes on 100k examples, runs at 0.3ms per item on CPU, and gives calibrated probabilities. DistilBERT is 2-3% better on hard cases but costs 10x more to serve. For most teams, XGBoost-on-embeddings is the 80/20 choice.
Alternatives: Fine-tuned DistilBERT, Setfit (few-shot), LinearSVC
Haiku 4 handles aspect-based sentiment and sarcasm better than the open-source small models at $0.80/$4 per MTok. GPT-4o-mini is 5x cheaper at $0.15/$0.60 but loses on Asian languages and nuanced sarcasm. Gemini 2.0 Flash is the cheapest at $0.075/$0.30 but worst on sarcasm.
Alternatives: GPT-4o-mini, Gemini 2.0 Flash, Claude Sonnet 4 for VIP items
Flink handles the stateful windowing for rolling sentiment indices better than Lambda-based systems. Kafka is the lingua franca for streaming at this volume.
Alternatives: Kinesis + Lambda, Pub/Sub + Dataflow, Redpanda + Bytewax
ClickHouse handles billions of rows with sub-second queries and materialized views for hourly rollups. Snowflake/BigQuery cost 3-5x more at this volume but integrate better with existing BI stacks.
Alternatives: Druid, Snowflake, BigQuery
Maintain a human-labeled golden set of 5-10k items across domains and languages. Label-Studio or Argilla for annotation. Re-eval any classifier change before rollout.
Alternatives: Braintrust, Argilla, Scale Studio
Cost at each scale
Prototype
1,000,000 items/mo
$85/mo
Startup
1,000,000,000 items/mo
$12,000/mo
Scale
30,000,000,000 items/mo
$180,000/mo
Latency budget
Tradeoffs
LLM-per-item vs classifier + tail
Sending every item to Claude Sonnet 4 costs $3-5 per million items. A tiered pipeline costs $0.015-$0.05 per million items - a 100x difference. The classifier catches 97% of sentiment signal; the LLM only handles the ambiguous tail where it actually adds value. Do not route every item through an LLM unless you are below 1M items/month.
Aspect-based vs 3-label sentiment
3-label (positive/negative/neutral) covers most business dashboards and is cheap. Aspect-based sentiment (camera: positive, battery: negative) is 10x more valuable for product teams but requires an LLM and adds cost. Compromise: run 3-label on 100% of items, run aspect-based on the tail and on a 1% sampled stream for product-team drilldown.
Multilingual with one model vs per-language pipeline
A single multilingual embedding + classifier is simpler to operate but loses 3-8% accuracy on non-English languages compared to dedicated models. Worth using per-language pipelines (especially for Chinese, Japanese, Arabic) if those markets drive business decisions, not just for English-only reporting.
Failure modes & guardrails
Sarcasm and irony misclassified as positive
Mitigation: Route items with strong positive words but negative context signals (1-star rating, 'NOT' patterns, laughing emoji sequences) to the LLM tail. Train a dedicated sarcasm head on 2-5k labeled sarcastic samples. LLMs catch 85%+ of sarcasm vs 40-60% for classifiers.
Concept drift - sentiment targets change meaning over time
Mitigation: Re-train the classifier monthly on recent labeled data. Monitor prediction distribution: if the share of 'positive' jumps 10%+ week-over-week with no business explanation, that is drift. Alert and re-label.
Coordinated review campaigns skew brand sentiment
Mitigation: Detect bursts: items from new accounts, same IP range, or same template text arriving in short windows. Flag bursts and exclude from brand sentiment index until reviewed. Do not let brigades or bot campaigns move the official number.
Non-English traffic is systematically mislabeled
Mitigation: Log accuracy per language on the golden set. If Japanese accuracy is 8% below English, that is visible. Fix with a Japanese-specific classifier or route Japanese items to an LLM tail at higher rates. Do not ship one accuracy number hiding language-level failures.
Dashboard users misinterpret sentiment movements
Mitigation: Show confidence intervals on every sentiment number. Require minimum sample sizes (500 items) before declaring a trend. Surface the top contributing items so analysts can validate. Never show a raw number without context - it will be weaponized in a meeting.
Frequently asked questions
Do I need an LLM for sentiment analysis at all?
For coarse sentiment (positive/negative/neutral) at billion-scale, no - a fine-tuned classifier gets 88-92% accuracy at 1/100th the cost. For aspect-based sentiment, sarcasm, and rationales, yes - LLMs are markedly better. The tiered pipeline lets you get both without paying LLM prices on every item.
Which embedding model for sentiment?
Voyage-3 leads MTEB classification in 2026. BGE-large is the best open-source option. Cohere Embed v3 is solid and has a good managed API. OpenAI text-embedding-3-small is cheapest ($0.00002/1k tokens) and good enough for most sentiment work - the bottleneck is the classifier head, not the embedding.
How much training data does the fast classifier need?
XGBoost on quality embeddings reaches 85% accuracy with 5k labeled examples and 90%+ with 30-50k. DistilBERT needs 50-100k for similar gains. Budget a week of labeling with 2-3 domain experts on a platform like Label-Studio to bootstrap your golden set and training data.
How often should I re-train?
Monthly for high-change domains (social media, news). Quarterly for stable domains (product reviews). Always re-train after a product launch, platform rule change, or major cultural event - these shift sentiment baselines.
Claude Haiku 4 vs GPT-4o-mini for the tail?
Haiku 4 is better at sarcasm and code-switched multilingual text but costs 5x more ($0.80/$4 vs $0.15/$0.60 per MTok). Start with GPT-4o-mini for cost. Move to Haiku 4 if sarcasm and aspect-quality matter and your tail volume justifies the spend. Gemini 2.0 Flash ($0.075/$0.30) is cheapest but worst on nuanced English.
How do I detect aspect-based sentiment cheaply?
Two options. (1) Prompt the LLM to return structured JSON with aspects + scores on the tail only. (2) Train a dedicated aspect extractor on 10-20k labeled examples (pyABSA or a fine-tuned T5) and run it on a 1-5% sampled stream. Option 2 is cheaper at scale but requires labeled data.
Related
Architectures
Realtime Content Moderation Pipeline
Reference architecture for moderating user-generated text and images in realtime. Tiered policy classifier, hu...
Intent Classification for Message Routing
Reference architecture for multi-label intent classification routing inbound customer messages to the right te...
Customer Support Agent
Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...