Evaluation

HellaSwag

Quick Answer

A benchmark of commonsense reasoning through completing video descriptions.

HellaSwag has 70K multiple-choice questions about video descriptions. It tests commonsense reasoning—predicting what happens next. HellaSwag is challenging for humans (78% accuracy). Modern models achieve ~88%+ accuracy. HellaSwag tests different reasoning than other benchmarks. It's useful for measuring commonsense. HellaSwag remains a standard benchmark.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →