Reference Architecture · classification

Realtime Content Moderation Pipeline

Last updated: April 16, 2026

Quick answer

The production stack is a three-tier funnel: a perceptual hash check (PhotoDNA / NCMEC) for known-bad images, a small fast classifier (Perspective API or a fine-tuned DistilBERT on Voyage-3 embeddings) for 95%+ of traffic, and GPT-4o or Claude Sonnet 4 vision only on the ambiguous tail. Route uncertain items to a human review queue with policy-specific SLAs. Expect $0.0002-$0.001 per item at scale, with P95 latency under 180ms on the fast path.

The problem

You operate a social, marketplace, or UGC product where users upload text, images, or both. You need to block clearly harmful content (CSAM, gore, credible threats, spam) inside 200ms, flag borderline content for human review, and maintain an auditable trail for every decision. The system must handle 10k-100k items per second without letting hate speech or nudity slip through, and without over-blocking legitimate speech.

Architecture

input

llm

data

infra

output

UGC Ingest API

Accepts text, images, or video frames from the client. Attaches user trust score, geo, and content metadata before forwarding.

Alternatives: Direct SDK upload, Webhook from upload service, Kafka stream

Perceptual Hash Match

Checks image/video hashes against PhotoDNA, NCMEC, and GIFCT databases for known CSAM and terrorist content. Hard block on match before any model runs.

Alternatives: PhotoDNA, GIFCT HSP, Apple NeuralHash, Internal hash store

Policy Classifier (Fast Path)

Small fine-tuned classifier running on embeddings. Returns multi-label scores for hate, sexual, violence, self-harm, spam, PII. Handles 95%+ of traffic with confident verdicts.

Alternatives: Perspective API, OpenAI Moderation API, Azure Content Safety, AWS Rekognition Moderation

Vision Moderator

Dedicated image classifier for nudity, violence, and graphic content. Runs on every image regardless of text verdict.

Alternatives: Hive Moderation, Google Vision SafeSearch, Azure Content Safety Image, Clarifai Moderation

LLM Deep Review (Slow Path)

Invoked only when fast classifier confidence falls in the 0.4-0.85 uncertainty band. Reads the full policy document and provides a structured rationale plus label.

Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Haiku 4 for lower-stakes policies

Policy Decision Engine

Combines signals (hash hit, classifier scores, LLM verdict, user trust) into a final action: allow, shadow-remove, hard-block, or queue-for-review. Applies jurisdiction rules (DSA, GDPR).

Alternatives: Rules engine (OPA), Custom Python service, Cortex XSOAR

Human Review Queue

Priority-sorted queue for moderators. Shows content, model rationale, user history, and suggested action. Ties into Trust and Safety team tooling.

Alternatives: Checkstep, TaskUs, Accenture Trust & Safety, Zendesk

Decision Audit Log

Append-only log of every decision with model versions, scores, reviewer ID, and final action. Required for DSA/DMA transparency reports and appeals.

Alternatives: BigQuery, Snowflake, ClickHouse, Postgres partitioned tables

User Appeals Service

Allows users to contest removals. Re-runs content through a more expensive model and sends to senior reviewer if still borderline.

Alternatives: Zendesk appeals flow, Custom React app, Intercom

The stack

Perceptual hashingPhotoDNA + GIFCT HSP

PhotoDNA is industry standard for CSAM detection and free for qualifying platforms through Microsoft. GIFCT HSP covers terrorist content. Both run in under 5ms per image. You want these even if you also have model-based moderation.

Alternatives: Apple NeuralHash, Internal MD5 + pHash

Fast text classifierVoyage-3 embeddings + fine-tuned DistilBERT head

Self-hosted embedding + classifier gives you policy-specific control and 20-40ms latency. OpenAI's Moderation API is free and good enough to start. Switch to fine-tuned once you have 10k+ labeled examples and need custom policies.

Alternatives: OpenAI Moderation API (free), Perspective API, Azure Content Safety

Vision moderatorAWS Rekognition Content Moderation

Rekognition has the best NSFW and violence recall on real-world UGC at a reasonable $0.001 per image. Hive is better on cartoon/anime content. Azure Content Safety has the strongest CSAM detection integration for Microsoft-aligned stacks.

Alternatives: Hive Moderation, Google Vision SafeSearch, Azure Content Safety Image

LLM deep reviewClaude Sonnet 4

Sonnet 4 follows complex policy documents better than GPT-4o for moderation. Haiku 4 is good enough for lower-stakes policies (spam, soft NSFW) at one-fourth the cost. Keep Opus 4 out of the hot path - use only for appeals.

Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Haiku 4

Decision orchestrationOpen Policy Agent (OPA) + custom service

OPA lets non-engineers (Trust & Safety leads) edit policy rules without deploys. Critical when regulators change requirements mid-year (DSA, UK Online Safety Act).

Alternatives: Pure Python rules, AWS Step Functions, Temporal

Observability + evalsBraintrust + ClickHouse

You need precision/recall dashboards sliced by policy, geo, and model version. ClickHouse handles billions of rows for realtime dashboards. Braintrust runs the daily golden-set evals.

Alternatives: Arize Phoenix, Langfuse, Custom dashboard

Cost at each scale

Prototype

100,000 items/mo

$120/mo

OpenAI Moderation API (free)$0

Rekognition image moderation (20% images)$20

Claude Sonnet 4 deep review (5% items)$40

Hosting (Vercel + Supabase)$25

Audit log storage$10

Observability (Langfuse free)$25

Startup

50,000,000 items/mo

$9,800/mo

Fast classifier inference (self-hosted)$1,200

Rekognition image moderation$3,500

Claude Sonnet 4 deep review (3% tail)$2,400

Vector DB + embeddings (Voyage-3)$400

ClickHouse audit log$600

Human reviewers (outsourced, 0.2% queue rate)$1,500

Braintrust evals + observability$200

Scale

5,000,000,000 items/mo

$420,000/mo

Self-hosted classifier cluster (GPUs)$85,000

Rekognition / in-house vision$140,000

Claude Sonnet 4 deep review tail (~1%)$90,000

Embeddings + vector infra$18,000

ClickHouse + S3 audit retention$22,000

Human moderation BPO$55,000

Evals, compliance, DSA reporting$10,000

Latency budget

Total P50: 1,556ms

Total P95: 2,930ms

Perceptual hash lookup

6ms · 15ms p95

Fast text classifier

28ms · 65ms p95

Vision classifier

110ms · 220ms p95

Policy decision engine

12ms · 30ms p95

LLM deep review (tail path only)

1400ms · 2600ms p95

Median

P95

Tradeoffs

LLM-only vs tiered pipeline

Routing every item through Claude Sonnet 4 would cost 50-100x more and blow P95 latency past 1 second. A tiered pipeline (hash + small classifier + LLM on the tail) handles 95% of decisions in under 100ms at one-tenth the cost. The LLM only touches the ambiguous middle where its judgment actually improves outcomes.

Precision vs recall tradeoff

High-recall (catch everything) policies like CSAM require low thresholds and large human queues - false positives are cheap, false negatives are catastrophic. High-precision policies like misinformation need high thresholds to avoid over-blocking speech. Tune thresholds per policy, not globally.

Managed API vs self-hosted classifier

Perspective API and OpenAI Moderation are free and fine at low volume. Self-hosting a fine-tuned DistilBERT becomes cheaper around 10M items/month and gives you policy-specific control. Self-hosting also lets you train on your own labeled data, which is the only way to beat generic APIs on platform-specific content.

Failure modes & guardrails

Adversarial inputs (leetspeak, homoglyphs, image perturbations) slip past classifiers

Mitigation: Run a text-normalization pass (unicode NFKC, leetspeak dictionary, homoglyph detection) before the classifier. For images, use perceptual hashes (pHash, wavelet hash) not just cryptographic hashes so minor edits still match known-bad content.

Model over-blocks legitimate speech (false positive spike after model update)

Mitigation: Shadow-deploy every model update for 24-48h before switching traffic. Compare precision/recall vs current prod on a golden set of 10k+ labeled items. Alert if precision drops more than 2% or if false-positive rate on any protected category (race, religion, LGBTQ+ terms used positively) spikes.

Human review queue overflows and SLAs slip

Mitigation: Priority-tier the queue: credible-threat and CSAM get 15-minute SLA, hate speech 2h, soft NSFW 24h. Auto-apply soft actions (shadow-remove, lowered reach) while waiting. Track queue depth per tier and auto-escalate staffing when depth grows beyond throughput.

Policy drift - enforcement differs from written policy

Mitigation: Treat the policy document as ground truth and run weekly evals of the full pipeline against it. Maintain a labeled golden set of 2-5k items per policy category and require precision/recall targets before deploying changes.

Regulatory exposure (DSA, UK Online Safety Act, COPPA)

Mitigation: Log every decision with model version, scores, reviewer, jurisdiction, and rationale. Produce monthly transparency reports automatically. For EU traffic, route through a DSA-compliant appeals workflow with 14-day response SLA. Treat under-13 (COPPA) and under-18 (various) users with stricter thresholds.

Frequently asked questions

Should I use OpenAI Moderation API or build my own classifier?

Start with OpenAI Moderation API or Perspective API - both are free and handle the obvious cases. Graduate to a fine-tuned classifier once you have 10k+ labeled examples from your own platform and need policy-specific control. Below 10M items/month, managed APIs are almost always cheaper.

How do I handle images? Is GPT-4o vision good enough?

GPT-4o vision is too slow and too expensive for every image ($0.005-$0.015/image, 800ms+). Use AWS Rekognition Moderation or Hive Moderation at $0.0005-$0.002/image with 100-200ms latency for the 99% case. Reserve Claude Sonnet 4 vision for the ambiguous tail that needs context (memes, satire, artistic nudity).

What precision and recall should I target?

For CSAM and credible threats: 99%+ recall, precision can be lower because humans review. For hate speech: 85-90% recall, 90%+ precision to avoid over-blocking. For spam: 95% recall, 95% precision. Publish your targets internally and eval against them weekly.

How much does moderation cost at scale?

At 5B items/month, budget $80k-$120k/month in model spend, $50k-$100k in human moderation (outsourced), and $30k in infra (audit log, observability). All-in cost per item is $0.00005-$0.0001. If you are paying more than $0.001 per item at 100M+ volume, your tiering is wrong.

Do I need human reviewers in the loop?

Yes. No model in 2026 is good enough to enforce nuanced policy (satire vs hate, news vs gore, context-dependent harassment) without humans. Plan for 0.1-0.5% of content to reach human review, with specialized reviewers for CSAM (trained, rotating, with mental health support). Full automation is a regulatory and PR liability.

How do I prevent model-vs-policy drift?

Maintain a golden evaluation set of 2-5k human-labeled items per policy category. Re-run the full pipeline against it on every model version change, prompt tweak, or threshold adjustment. Block deploys that regress precision or recall more than 2%. Refresh the golden set quarterly to capture new attack patterns.

Which jurisdictions care about my moderation stack?

EU (Digital Services Act - transparency reports, appeals, risk assessments), UK (Online Safety Act - child safety, proactive detection), US (Section 230 coverage but state laws like Texas HB 20 matter), Germany (NetzDG - 24h removal for flagged hate speech), India (IT Rules 2021). Design the audit log and appeals flow to satisfy DSA since it is the strictest.

Architectures

Sentiment Analysis at Scale

Reference architecture for classifying sentiment across billions of reviews, social posts, and support message...

Intent Classification for Message Routing

Reference architecture for multi-label intent classification routing inbound customer messages to the right te...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 claude-haiku-4 gpt-4o gemini-2-5-pro