Reference Architecture · generation

Translation Pipeline at Scale

Last updated: April 16, 2026

Quick answer

The production stack loads a translation memory and a glossary into a cached system prompt, uses Claude Sonnet 4 or GPT-4o for quality-critical content and Gemini 2.0 Flash for high-volume low-stakes strings, runs an automated QA pass (placeholder check, term consistency, back-translation diff), and surfaces anything risky to a human linguist. Batch via OpenAI/Anthropic Batch API for 50 percent cost savings, realtime via streaming. Expect $0.0005 to $0.003 per translated string.

The problem

You need to translate product strings, help center articles, and marketing copy into 20+ locales, keep brand terminology consistent across translators, preserve ICU message format placeholders, and support both overnight batch jobs (thousands of strings) and realtime interactive translation (chat, comments). Off-the-shelf MT sounds wooden and breaks on domain terms; sending raw strings to ChatGPT blows your glossary on every call.

Architecture

input

llm

data

infra

output

Source Ingest

Accepts strings from CMS, product codebase (i18n JSON/YAML), support ticket streams, and marketing platforms. Detects source locale and content type.

Alternatives: GitHub Action on i18n file change, Contentful webhook, Direct API submit, CSV upload for campaigns

Translation Memory Lookup

For every source segment, checks translation memory for 100 percent matches (reuse directly, zero LLM cost) and fuzzy matches (75-99 percent similarity, used as reference for LLM).

Alternatives: pgvector segment store, Pinecone with sentence embeddings, Redis with normalized hash for exact match

Glossary & Style Guide Injector

Loads the per-locale glossary (brand terms, product names, forbidden translations) and tone guide into the cached system prompt. Also injects Do-Not-Translate list.

Alternatives: Single global glossary, Per-customer glossaries for localization platforms, LLM-learned glossary with periodic human review

Content Router

Classifies each segment by content type (UI string, marketing headline, doc body, legal, casual chat) and routes to the right model. Legal and marketing go to Sonnet 4; UI strings and chat go to Flash.

Alternatives: Rule-based on content-type tag, Claude Haiku 4, GPT-4o-mini

Translator LLM

Performs the actual translation with source string, glossary, TM references, and target locale. Emits structured output including translation and a confidence score.

Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek V3 for CJK, Llama 3.3 70B for on-prem

Automated QA

Validates ICU placeholders survived round-trip, glossary terms are used, length is within target locale bounds, and back-translation (target → source) does not diverge semantically beyond threshold.

Alternatives: LanguageTool integration for grammar, Sacrebleu or COMET score against TM, LLM-as-judge quality rubric

Human Linguist Review

For content flagged as low-confidence, high-stakes (marketing, legal), or new domain terms. Linguist edits in-place; approved edits feed back to TM and glossary.

Alternatives: Smartling, Phrase (formerly Memsource), Crowdin, Custom review UI

TM & Glossary Writeback

Approved translations are written to translation memory, embedded for future fuzzy lookup, and terminology extracted for glossary updates.

Alternatives: Manual TM management, Automatic with weekly human audit (recommended)

Batch Worker

Uses OpenAI or Anthropic Batch APIs for non-urgent large jobs at 50 percent cost reduction. SLA is 24h but typically completes in 1-4 hours.

Alternatives: Inngest queue, SQS + Lambda, Streaming for realtime only

The stack

Primary translation modelClaude Sonnet 4 for quality-critical, Gemini 2.0 Flash for bulk

Sonnet 4 preserves tone and handles glossary constraints most reliably across 20+ languages in 2026. Flash is 95+ percent as good for UI strings and chat at 5 percent of the cost. DeepSeek V3 is measurably better for zh, ja, ko. For any locale, test against 200+ translated strings before committing.

Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek V3 for CJK languages

Translation memorypgvector with Voyage-3 multilingual embeddings

You want both exact-match (hash lookup in Redis, free) and fuzzy-match (embedding search in pgvector, dollars per thousand) layers. 100 percent TM matches should never hit the LLM — they are the single biggest cost reduction lever in localization.

Alternatives: Pinecone, Qdrant, Exact-match Redis layer in front for 100 percent matches

Glossary and style guideJSON glossary per locale, injected via prompt caching

Glossaries are typically 500-5,000 terms per locale which is too large to inline on every call. Cache the glossary as the first prefix block; it stays in cache across a batch of calls. Cache hit cuts input cost ~90 percent.

Alternatives: Fine-tuned model per locale, RAG over glossary, Inline in every prompt (no caching)

QA and evaluationCOMET-Kiwi reference-free + placeholder check + LLM judge for flagged items

COMET-Kiwi gives a reference-free quality score that correlates well with human judgment. Placeholder and glossary checks are deterministic must-haves. LLM judge is the final safety net for high-stakes content.

Alternatives: BLEU (less reliable in 2026), Human evaluation sample, BERTScore

Batch APIAnthropic Batch API or OpenAI Batch API

50 percent cost reduction for non-urgent jobs is a massive lever at scale. Typical completion is 1-4 hours against the 24h SLA. Use the queue for realtime (chat, user comments) and the batch API for everything else.

Alternatives: In-house queue (Inngest, SQS), Streaming only

ObservabilityLangfuse or Braintrust with per-locale dashboards

You need per-locale quality tracking because model performance varies dramatically by language pair. Track TM hit rate, average COMET score, human edit rate, and cost per locale separately.

Alternatives: Self-hosted logging, Helicone

Cost at each scale

Prototype

10k strings/mo across 5 locales

$55/mo

Translation (mixed Sonnet 4 / Flash)$28

Embeddings for TM$4

QA (COMET + placeholder check)$3

Hosting (Vercel Hobby + Supabase free)$0

Observability (free tier)$20

Startup

500k strings/mo across 20 locales

$1,350/mo

Translation (batch API, glossary cached)$650

Realtime translation (streaming)$240

TM embeddings + vector DB$150

QA (COMET + LLM judge)$90

Infra (Vercel Pro + queue)$50

Observability (Langfuse)$170

Scale

20M strings/mo across 30 locales

$28,500/mo

Translation (batch + realtime, heavy caching)$17,000

TM 100 percent match deflection (Redis)$300

TM fuzzy match (pgvector)$2,500

QA + back-translation$2,800

Infra (Vercel Enterprise + Inngest Pro)$1,500

Observability + evals$1,400

Human linguist review platform$3,000

Latency budget

Total P50: 1,895ms

Total P95: 3,715ms

TM exact match lookup (Redis)

5ms · 15ms p95

TM fuzzy match (pgvector)

60ms · 180ms p95

Realtime translation (Flash, streamed)

450ms · 900ms p95

Realtime translation (Sonnet 4, streamed)

1200ms · 2200ms p95

Automated QA (COMET + checks)

180ms · 420ms p95

Median

P95

Tradeoffs

LLM translation vs traditional NMT

LLMs (Sonnet 4, GPT-4o) beat DeepL, Google Translate, and Amazon Translate on tone, glossary adherence, and context-aware idiom handling. NMT is still 3-5x cheaper per character and faster. The production pattern uses NMT as a first pass for TM seeding then LLM for anything that matters — but most teams now skip NMT entirely once they have good TM.

Single prompt for all locales vs per-locale system prompt

A single prompt with target locale as a parameter is simpler but loses locale-specific context (formal vs informal pronouns, date formats, cultural sensitivity). Per-locale system prompts with locale-specific glossaries and tone guidance produce measurably better translations and are essential for locales like ja, ko, de, and ar where register matters.

Batch vs realtime economics

Batch API gives 50 percent off but a 24h SLA. Realtime costs full price but streams in under 2 seconds. Split by use case: product strings, help docs, marketing copy all go batch overnight. Chat messages, user comments, and urgent support translations go realtime. The split typically puts 80 percent of volume on batch.

Failure modes & guardrails

ICU placeholders dropped or corrupted

Mitigation: Extract all placeholders ({name}, {{count}}, %s, etc.) from source, require they appear verbatim in output via deterministic regex check. If missing, retry with explicit instruction listing each placeholder. Reject and route to human review if still missing after retry. Track placeholder survival rate per locale.

Glossary term translated inconsistently across segments

Mitigation: After translation, scan output for all glossary source terms and verify the mapped target term is used. If wrong translation found, run a targeted rewrite pass. Track per-term adherence rate — a dropping rate means the glossary needs disambiguation (e.g., same English term maps to different target terms by context).

Length explosion in target language breaks UI

Mitigation: For UI strings, pass a max-character hint (typically source length * 1.3 for Germanic, 1.5 for Romance). Validate output length; if over budget, retry with compression instruction. For unbounded content, flag for human review rather than truncate.

Low-resource languages (Swahili, Tagalog, Bengali) produce fluent but wrong translations

Mitigation: COMET-Kiwi scores are less reliable for low-resource languages. Sample 5-10 percent of all translations for human review in these locales versus 1 percent for well-resourced pairs. Maintain locale-specific glossaries that are 2-3x larger than for well-resourced languages.

Prompt injection via user-submitted content (chat, comments)

Mitigation: Treat every source string as untrusted. Strip or neutralize patterns that look like instructions (Ignore previous..., System:, role:). Use structured output so the model can only emit a translation field; anything outside that shape is discarded. Run a prompt-injection classifier on high-volume user-generated flows.

Frequently asked questions

Which LLM is best for translation in 2026?

Claude Sonnet 4 and GPT-4o are both strong across most language pairs. DeepSeek V3 is measurably better for zh/ja/ko. Gemini 2.5 Pro is competitive on European languages. Gemini 2.0 Flash is good enough for bulk UI strings at 5 percent of the cost. Always benchmark on 200+ strings from your actual content before committing to a model per locale.

Should I use an LLM or traditional machine translation like DeepL?

LLMs win on tone, glossary adherence, and domain-specific terminology. DeepL and Google Translate are cheaper per character and faster but wooden, and they break on product terms. Most serious localization teams in 2026 have moved to LLMs for anything customer-facing and kept MT only for TM seeding or low-stakes internal content.

How do I enforce brand terminology across translations?

Maintain a per-locale glossary as a JSON map of source term -> required target term, injected into the cached system prompt. Post-translation, scan the output for every source term present and verify the mapped target was used. Retry with a targeted rewrite if the glossary term was ignored. Track glossary adherence rate per term.

What is the right way to handle placeholders like {name} or %s?

Extract all placeholders from source with a regex before translation. After translation, verify each placeholder appears verbatim in the target. If missing, retry with an explicit list of required placeholders; if still missing, route to human review. Do not rely on the LLM to silently preserve them — validate deterministically.

How much can translation memory reduce LLM costs?

On mature localization pipelines, TM deflects 30-60 percent of requests as 100 percent matches (zero LLM cost) and provides fuzzy-match references that improve LLM quality on another 20-30 percent. A Redis-backed exact-match layer costs pennies and is usually the highest-ROI optimization in the whole stack.

Do I need batch APIs, or is streaming fine?

Use both. Batch (Anthropic or OpenAI) gives a flat 50 percent discount on non-urgent jobs — product strings, docs, marketing all go through batch overnight. Realtime streaming is for chat, user comments, and on-demand translation where a 24h SLA does not work. Typical split is 80 percent batch, 20 percent realtime at scale.

How do I evaluate translation quality automatically?

COMET-Kiwi is the 2026 default for reference-free quality scoring — it correlates well with human judgment across most language pairs. Combine with deterministic placeholder and glossary checks, and an LLM-as-judge pass for flagged items. For low-resource languages, sample 5-10 percent for human review.

When do I need human linguists?

Always for marketing copy and legal content — the stakes are high and the creative judgment required is still beyond LLMs. For product UI strings and help docs, sample 1-5 percent for review. The review platform should push edits back to TM and glossary so the pipeline improves monotonically.

Architectures

Long-Form Content Generation

Reference architecture for generating 3,000 to 10,000 word blog posts, research reports, and documentation. Ou...

Customer Support Agent

Reference architecture for an LLM-powered customer support agent handling 10k+ conversations/day. Models, stac...

Models mentioned

claude-sonnet-4 gpt-4o gemini-2-0-flash deepseek-v3

Tools mentioned

pgvector langfuse