Reference Architecture · classification

Realtime Content Moderation Pipeline

Last updated: April 16, 2026

Quick answer

The production stack is a three-tier funnel: a perceptual hash check (PhotoDNA / NCMEC) for known-bad images, a small fast classifier (Perspective API or a fine-tuned DistilBERT on Voyage-3 embeddings) for 95%+ of traffic, and GPT-4o or Claude Sonnet 4 vision only on the ambiguous tail. Route uncertain items to a human review queue with policy-specific SLAs. Expect $0.0002-$0.001 per item at scale, with P95 latency under 180ms on the fast path.

The problem

You operate a social, marketplace, or UGC product where users upload text, images, or both. You need to block clearly harmful content (CSAM, gore, credible threats, spam) inside 200ms, flag borderline content for human review, and maintain an auditable trail for every decision. The system must handle 10k-100k items per second without letting hate speech or nudity slip through, and without over-blocking legitimate speech.

Architecture

known-bad hard blockno matchif media presentconfidentuncertain 0.4-0.85needs reviewUGC Ingest APIINPUTPerceptual Hash MatchINFRAPolicy Classifier (Fast Path)LLMVision ModeratorLLMLLM Deep Review (Slow Path)LLMPolicy Decision EngineINFRAHuman Review QueueOUTPUTDecision Audit LogDATAUser Appeals ServiceOUTPUT
input
llm
data
infra
output

UGC Ingest API

Accepts text, images, or video frames from the client. Attaches user trust score, geo, and content metadata before forwarding.

Alternatives: Direct SDK upload, Webhook from upload service, Kafka stream

Perceptual Hash Match

Checks image/video hashes against PhotoDNA, NCMEC, and GIFCT databases for known CSAM and terrorist content. Hard block on match before any model runs.

Alternatives: PhotoDNA, GIFCT HSP, Apple NeuralHash, Internal hash store

Policy Classifier (Fast Path)

Small fine-tuned classifier running on embeddings. Returns multi-label scores for hate, sexual, violence, self-harm, spam, PII. Handles 95%+ of traffic with confident verdicts.

Alternatives: Perspective API, OpenAI Moderation API, Azure Content Safety, AWS Rekognition Moderation

Vision Moderator

Dedicated image classifier for nudity, violence, and graphic content. Runs on every image regardless of text verdict.

Alternatives: Hive Moderation, Google Vision SafeSearch, Azure Content Safety Image, Clarifai Moderation

LLM Deep Review (Slow Path)

Invoked only when fast classifier confidence falls in the 0.4-0.85 uncertainty band. Reads the full policy document and provides a structured rationale plus label.

Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Haiku 4 for lower-stakes policies

Policy Decision Engine

Combines signals (hash hit, classifier scores, LLM verdict, user trust) into a final action: allow, shadow-remove, hard-block, or queue-for-review. Applies jurisdiction rules (DSA, GDPR).

Alternatives: Rules engine (OPA), Custom Python service, Cortex XSOAR

Human Review Queue

Priority-sorted queue for moderators. Shows content, model rationale, user history, and suggested action. Ties into Trust and Safety team tooling.

Alternatives: Checkstep, TaskUs, Accenture Trust & Safety, Zendesk

Decision Audit Log

Append-only log of every decision with model versions, scores, reviewer ID, and final action. Required for DSA/DMA transparency reports and appeals.

Alternatives: BigQuery, Snowflake, ClickHouse, Postgres partitioned tables

User Appeals Service

Allows users to contest removals. Re-runs content through a more expensive model and sends to senior reviewer if still borderline.

Alternatives: Zendesk appeals flow, Custom React app, Intercom

The stack

Perceptual hashingPhotoDNA + GIFCT HSP

PhotoDNA is industry standard for CSAM detection and free for qualifying platforms through Microsoft. GIFCT HSP covers terrorist content. Both run in under 5ms per image. You want these even if you also have model-based moderation.

Alternatives: Apple NeuralHash, Internal MD5 + pHash

Fast text classifierVoyage-3 embeddings + fine-tuned DistilBERT head

Self-hosted embedding + classifier gives you policy-specific control and 20-40ms latency. OpenAI's Moderation API is free and good enough to start. Switch to fine-tuned once you have 10k+ labeled examples and need custom policies.

Alternatives: OpenAI Moderation API (free), Perspective API, Azure Content Safety

Vision moderatorAWS Rekognition Content Moderation

Rekognition has the best NSFW and violence recall on real-world UGC at a reasonable $0.001 per image. Hive is better on cartoon/anime content. Azure Content Safety has the strongest CSAM detection integration for Microsoft-aligned stacks.

Alternatives: Hive Moderation, Google Vision SafeSearch, Azure Content Safety Image

LLM deep reviewClaude Sonnet 4

Sonnet 4 follows complex policy documents better than GPT-4o for moderation. Haiku 4 is good enough for lower-stakes policies (spam, soft NSFW) at one-fourth the cost. Keep Opus 4 out of the hot path - use only for appeals.

Alternatives: GPT-4o, Gemini 2.5 Pro, Claude Haiku 4

Decision orchestrationOpen Policy Agent (OPA) + custom service

OPA lets non-engineers (Trust & Safety leads) edit policy rules without deploys. Critical when regulators change requirements mid-year (DSA, UK Online Safety Act).

Alternatives: Pure Python rules, AWS Step Functions, Temporal

Observability + evalsBraintrust + ClickHouse

You need precision/recall dashboards sliced by policy, geo, and model version. ClickHouse handles billions of rows for realtime dashboards. Braintrust runs the daily golden-set evals.

Alternatives: Arize Phoenix, Langfuse, Custom dashboard

Cost at each scale

Prototype

100,000 items/mo

$120/mo

OpenAI Moderation API (free)$0
Rekognition image moderation (20% images)$20
Claude Sonnet 4 deep review (5% items)$40
Hosting (Vercel + Supabase)$25
Audit log storage$10
Observability (Langfuse free)$25

Startup

50,000,000 items/mo

$9,800/mo

Fast classifier inference (self-hosted)$1,200
Rekognition image moderation$3,500
Claude Sonnet 4 deep review (3% tail)$2,400
Vector DB + embeddings (Voyage-3)$400
ClickHouse audit log$600
Human reviewers (outsourced, 0.2% queue rate)$1,500
Braintrust evals + observability$200

Scale

5,000,000,000 items/mo

$420,000/mo

Self-hosted classifier cluster (GPUs)$85,000
Rekognition / in-house vision$140,000
Claude Sonnet 4 deep review tail (~1%)$90,000
Embeddings + vector infra$18,000
ClickHouse + S3 audit retention$22,000
Human moderation BPO$55,000
Evals, compliance, DSA reporting$10,000

Latency budget

Total P50: 1,556ms
Total P95: 2,930ms
Perceptual hash lookup
6ms · 15ms p95
Fast text classifier
28ms · 65ms p95
Vision classifier
110ms · 220ms p95
Policy decision engine
12ms · 30ms p95
LLM deep review (tail path only)
1400ms · 2600ms p95
Median
P95

Tradeoffs

LLM-only vs tiered pipeline

Routing every item through Claude Sonnet 4 would cost 50-100x more and blow P95 latency past 1 second. A tiered pipeline (hash + small classifier + LLM on the tail) handles 95% of decisions in under 100ms at one-tenth the cost. The LLM only touches the ambiguous middle where its judgment actually improves outcomes.

Precision vs recall tradeoff

High-recall (catch everything) policies like CSAM require low thresholds and large human queues - false positives are cheap, false negatives are catastrophic. High-precision policies like misinformation need high thresholds to avoid over-blocking speech. Tune thresholds per policy, not globally.

Managed API vs self-hosted classifier

Perspective API and OpenAI Moderation are free and fine at low volume. Self-hosting a fine-tuned DistilBERT becomes cheaper around 10M items/month and gives you policy-specific control. Self-hosting also lets you train on your own labeled data, which is the only way to beat generic APIs on platform-specific content.

Failure modes & guardrails

Adversarial inputs (leetspeak, homoglyphs, image perturbations) slip past classifiers

Mitigation: Run a text-normalization pass (unicode NFKC, leetspeak dictionary, homoglyph detection) before the classifier. For images, use perceptual hashes (pHash, wavelet hash) not just cryptographic hashes so minor edits still match known-bad content.

Model over-blocks legitimate speech (false positive spike after model update)

Mitigation: Shadow-deploy every model update for 24-48h before switching traffic. Compare precision/recall vs current prod on a golden set of 10k+ labeled items. Alert if precision drops more than 2% or if false-positive rate on any protected category (race, religion, LGBTQ+ terms used positively) spikes.

Human review queue overflows and SLAs slip

Mitigation: Priority-tier the queue: credible-threat and CSAM get 15-minute SLA, hate speech 2h, soft NSFW 24h. Auto-apply soft actions (shadow-remove, lowered reach) while waiting. Track queue depth per tier and auto-escalate staffing when depth grows beyond throughput.

Policy drift - enforcement differs from written policy

Mitigation: Treat the policy document as ground truth and run weekly evals of the full pipeline against it. Maintain a labeled golden set of 2-5k items per policy category and require precision/recall targets before deploying changes.

Regulatory exposure (DSA, UK Online Safety Act, COPPA)

Mitigation: Log every decision with model version, scores, reviewer, jurisdiction, and rationale. Produce monthly transparency reports automatically. For EU traffic, route through a DSA-compliant appeals workflow with 14-day response SLA. Treat under-13 (COPPA) and under-18 (various) users with stricter thresholds.

Frequently asked questions

Should I use OpenAI Moderation API or build my own classifier?

Start with OpenAI Moderation API or Perspective API - both are free and handle the obvious cases. Graduate to a fine-tuned classifier once you have 10k+ labeled examples from your own platform and need policy-specific control. Below 10M items/month, managed APIs are almost always cheaper.

How do I handle images? Is GPT-4o vision good enough?

GPT-4o vision is too slow and too expensive for every image ($0.005-$0.015/image, 800ms+). Use AWS Rekognition Moderation or Hive Moderation at $0.0005-$0.002/image with 100-200ms latency for the 99% case. Reserve Claude Sonnet 4 vision for the ambiguous tail that needs context (memes, satire, artistic nudity).

What precision and recall should I target?

For CSAM and credible threats: 99%+ recall, precision can be lower because humans review. For hate speech: 85-90% recall, 90%+ precision to avoid over-blocking. For spam: 95% recall, 95% precision. Publish your targets internally and eval against them weekly.

How much does moderation cost at scale?

At 5B items/month, budget $80k-$120k/month in model spend, $50k-$100k in human moderation (outsourced), and $30k in infra (audit log, observability). All-in cost per item is $0.00005-$0.0001. If you are paying more than $0.001 per item at 100M+ volume, your tiering is wrong.

Do I need human reviewers in the loop?

Yes. No model in 2026 is good enough to enforce nuanced policy (satire vs hate, news vs gore, context-dependent harassment) without humans. Plan for 0.1-0.5% of content to reach human review, with specialized reviewers for CSAM (trained, rotating, with mental health support). Full automation is a regulatory and PR liability.

How do I prevent model-vs-policy drift?

Maintain a golden evaluation set of 2-5k human-labeled items per policy category. Re-run the full pipeline against it on every model version change, prompt tweak, or threshold adjustment. Block deploys that regress precision or recall more than 2%. Refresh the golden set quarterly to capture new attack patterns.

Which jurisdictions care about my moderation stack?

EU (Digital Services Act - transparency reports, appeals, risk assessments), UK (Online Safety Act - child safety, proactive detection), US (Section 230 coverage but state laws like Texas HB 20 matter), Germany (NetzDG - 24h removal for flagged hate speech), India (IT Rules 2021). Design the audit log and appeals flow to satisfy DSA since it is the strictest.

Related