Evaluation
ROUGE Score
Quick Answer
A metric measuring recall of n-grams between model output and reference summaries.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures summarization quality. ROUGE-1/2/L measure unigram/bigram/longest-common-subsequence overlap. ROUGE is standard for summarization but has limitations. ROUGE might penalize reasonable paraphrases. Better metrics exist but ROUGE remains standard. ROUGE is less relevant for open-ended generation.
Last verified: 2026-04-08