Evaluation

BLEU Score

Quick Answer

A metric measuring similarity between machine translation output and reference translations.

BLEU (Bilingual Evaluation Understudy) measures translation quality by comparing n-gram overlap. Higher BLEU means more similar to reference. BLEU is simple and automated but imperfect. BLEU correlates reasonably with human judgment. BLEU has well-known limitations (penalizes paraphrase). For translation, BLEU is standard. BLEU is less relevant for modern LLMs.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →