Reference Architecture · generation
Translation Pipeline at Scale
Last updated: April 16, 2026
Quick answer
The production stack loads a translation memory and a glossary into a cached system prompt, uses Claude Sonnet 4 or GPT-4o for quality-critical content and Gemini 2.0 Flash for high-volume low-stakes strings, runs an automated QA pass (placeholder check, term consistency, back-translation diff), and surfaces anything risky to a human linguist. Batch via OpenAI/Anthropic Batch API for 50 percent cost savings, realtime via streaming. Expect $0.0005 to $0.003 per translated string.
The problem
You need to translate product strings, help center articles, and marketing copy into 20+ locales, keep brand terminology consistent across translators, preserve ICU message format placeholders, and support both overnight batch jobs (thousands of strings) and realtime interactive translation (chat, comments). Off-the-shelf MT sounds wooden and breaks on domain terms; sending raw strings to ChatGPT blows your glossary on every call.
Architecture
Source Ingest
Accepts strings from CMS, product codebase (i18n JSON/YAML), support ticket streams, and marketing platforms. Detects source locale and content type.
Alternatives: GitHub Action on i18n file change, Contentful webhook, Direct API submit, CSV upload for campaigns
Translation Memory Lookup
For every source segment, checks translation memory for 100 percent matches (reuse directly, zero LLM cost) and fuzzy matches (75-99 percent similarity, used as reference for LLM).
Alternatives: pgvector segment store, Pinecone with sentence embeddings, Redis with normalized hash for exact match
Glossary & Style Guide Injector
Loads the per-locale glossary (brand terms, product names, forbidden translations) and tone guide into the cached system prompt. Also injects Do-Not-Translate list.
Alternatives: Single global glossary, Per-customer glossaries for localization platforms, LLM-learned glossary with periodic human review
Content Router
Classifies each segment by content type (UI string, marketing headline, doc body, legal, casual chat) and routes to the right model. Legal and marketing go to Sonnet 4; UI strings and chat go to Flash.
Alternatives: Rule-based on content-type tag, Claude Haiku 4, GPT-4o-mini
Translator LLM
Performs the actual translation with source string, glossary, TM references, and target locale. Emits structured output including translation and a confidence score.
Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek V3 for CJK, Llama 3.3 70B for on-prem
Automated QA
Validates ICU placeholders survived round-trip, glossary terms are used, length is within target locale bounds, and back-translation (target → source) does not diverge semantically beyond threshold.
Alternatives: LanguageTool integration for grammar, Sacrebleu or COMET score against TM, LLM-as-judge quality rubric
Human Linguist Review
For content flagged as low-confidence, high-stakes (marketing, legal), or new domain terms. Linguist edits in-place; approved edits feed back to TM and glossary.
Alternatives: Smartling, Phrase (formerly Memsource), Crowdin, Custom review UI
TM & Glossary Writeback
Approved translations are written to translation memory, embedded for future fuzzy lookup, and terminology extracted for glossary updates.
Alternatives: Manual TM management, Automatic with weekly human audit (recommended)
Batch Worker
Uses OpenAI or Anthropic Batch APIs for non-urgent large jobs at 50 percent cost reduction. SLA is 24h but typically completes in 1-4 hours.
Alternatives: Inngest queue, SQS + Lambda, Streaming for realtime only
The stack
Sonnet 4 preserves tone and handles glossary constraints most reliably across 20+ languages in 2026. Flash is 95+ percent as good for UI strings and chat at 5 percent of the cost. DeepSeek V3 is measurably better for zh, ja, ko. For any locale, test against 200+ translated strings before committing.
Alternatives: GPT-4o, Gemini 2.5 Pro, DeepSeek V3 for CJK languages
You want both exact-match (hash lookup in Redis, free) and fuzzy-match (embedding search in pgvector, dollars per thousand) layers. 100 percent TM matches should never hit the LLM — they are the single biggest cost reduction lever in localization.
Alternatives: Pinecone, Qdrant, Exact-match Redis layer in front for 100 percent matches
Glossaries are typically 500-5,000 terms per locale which is too large to inline on every call. Cache the glossary as the first prefix block; it stays in cache across a batch of calls. Cache hit cuts input cost ~90 percent.
Alternatives: Fine-tuned model per locale, RAG over glossary, Inline in every prompt (no caching)
COMET-Kiwi gives a reference-free quality score that correlates well with human judgment. Placeholder and glossary checks are deterministic must-haves. LLM judge is the final safety net for high-stakes content.
Alternatives: BLEU (less reliable in 2026), Human evaluation sample, BERTScore
50 percent cost reduction for non-urgent jobs is a massive lever at scale. Typical completion is 1-4 hours against the 24h SLA. Use the queue for realtime (chat, user comments) and the batch API for everything else.
Alternatives: In-house queue (Inngest, SQS), Streaming only
You need per-locale quality tracking because model performance varies dramatically by language pair. Track TM hit rate, average COMET score, human edit rate, and cost per locale separately.
Alternatives: Self-hosted logging, Helicone
Cost at each scale
Prototype
10k strings/mo across 5 locales
$55/mo
Startup
500k strings/mo across 20 locales
$1,350/mo
Scale
20M strings/mo across 30 locales
$28,500/mo
Latency budget
Tradeoffs
LLM translation vs traditional NMT
LLMs (Sonnet 4, GPT-4o) beat DeepL, Google Translate, and Amazon Translate on tone, glossary adherence, and context-aware idiom handling. NMT is still 3-5x cheaper per character and faster. The production pattern uses NMT as a first pass for TM seeding then LLM for anything that matters — but most teams now skip NMT entirely once they have good TM.
Single prompt for all locales vs per-locale system prompt
A single prompt with target locale as a parameter is simpler but loses locale-specific context (formal vs informal pronouns, date formats, cultural sensitivity). Per-locale system prompts with locale-specific glossaries and tone guidance produce measurably better translations and are essential for locales like ja, ko, de, and ar where register matters.
Batch vs realtime economics
Batch API gives 50 percent off but a 24h SLA. Realtime costs full price but streams in under 2 seconds. Split by use case: product strings, help docs, marketing copy all go batch overnight. Chat messages, user comments, and urgent support translations go realtime. The split typically puts 80 percent of volume on batch.
Failure modes & guardrails
ICU placeholders dropped or corrupted
Mitigation: Extract all placeholders ({name}, {{count}}, %s, etc.) from source, require they appear verbatim in output via deterministic regex check. If missing, retry with explicit instruction listing each placeholder. Reject and route to human review if still missing after retry. Track placeholder survival rate per locale.
Glossary term translated inconsistently across segments
Mitigation: After translation, scan output for all glossary source terms and verify the mapped target term is used. If wrong translation found, run a targeted rewrite pass. Track per-term adherence rate — a dropping rate means the glossary needs disambiguation (e.g., same English term maps to different target terms by context).
Length explosion in target language breaks UI
Mitigation: For UI strings, pass a max-character hint (typically source length * 1.3 for Germanic, 1.5 for Romance). Validate output length; if over budget, retry with compression instruction. For unbounded content, flag for human review rather than truncate.
Low-resource languages (Swahili, Tagalog, Bengali) produce fluent but wrong translations
Mitigation: COMET-Kiwi scores are less reliable for low-resource languages. Sample 5-10 percent of all translations for human review in these locales versus 1 percent for well-resourced pairs. Maintain locale-specific glossaries that are 2-3x larger than for well-resourced languages.
Prompt injection via user-submitted content (chat, comments)
Mitigation: Treat every source string as untrusted. Strip or neutralize patterns that look like instructions (Ignore previous..., System:, role:). Use structured output so the model can only emit a translation field; anything outside that shape is discarded. Run a prompt-injection classifier on high-volume user-generated flows.
Frequently asked questions
Which LLM is best for translation in 2026?
Claude Sonnet 4 and GPT-4o are both strong across most language pairs. DeepSeek V3 is measurably better for zh/ja/ko. Gemini 2.5 Pro is competitive on European languages. Gemini 2.0 Flash is good enough for bulk UI strings at 5 percent of the cost. Always benchmark on 200+ strings from your actual content before committing to a model per locale.
Should I use an LLM or traditional machine translation like DeepL?
LLMs win on tone, glossary adherence, and domain-specific terminology. DeepL and Google Translate are cheaper per character and faster but wooden, and they break on product terms. Most serious localization teams in 2026 have moved to LLMs for anything customer-facing and kept MT only for TM seeding or low-stakes internal content.
How do I enforce brand terminology across translations?
Maintain a per-locale glossary as a JSON map of source term -> required target term, injected into the cached system prompt. Post-translation, scan the output for every source term present and verify the mapped target was used. Retry with a targeted rewrite if the glossary term was ignored. Track glossary adherence rate per term.
What is the right way to handle placeholders like {name} or %s?
Extract all placeholders from source with a regex before translation. After translation, verify each placeholder appears verbatim in the target. If missing, retry with an explicit list of required placeholders; if still missing, route to human review. Do not rely on the LLM to silently preserve them — validate deterministically.
How much can translation memory reduce LLM costs?
On mature localization pipelines, TM deflects 30-60 percent of requests as 100 percent matches (zero LLM cost) and provides fuzzy-match references that improve LLM quality on another 20-30 percent. A Redis-backed exact-match layer costs pennies and is usually the highest-ROI optimization in the whole stack.
Do I need batch APIs, or is streaming fine?
Use both. Batch (Anthropic or OpenAI) gives a flat 50 percent discount on non-urgent jobs — product strings, docs, marketing all go through batch overnight. Realtime streaming is for chat, user comments, and on-demand translation where a 24h SLA does not work. Typical split is 80 percent batch, 20 percent realtime at scale.
How do I evaluate translation quality automatically?
COMET-Kiwi is the 2026 default for reference-free quality scoring — it correlates well with human judgment across most language pairs. Combine with deterministic placeholder and glossary checks, and an LLM-as-judge pass for flagged items. For low-resource languages, sample 5-10 percent for human review.
When do I need human linguists?
Always for marketing copy and legal content — the stakes are high and the creative judgment required is still beyond LLMs. For product UI strings and help docs, sample 1-5 percent for review. The review platform should push edits back to TM and glossary so the pipeline improves monotonically.