Evaluation

Pass@K

Quick Answer

For code generation, the probability that at least one of K samples passes tests.

Pass@K measures code generation quality: sample K outputs and check if any pass tests. Pass@1 = one sample, Pass@10 = ten samples. Pass@K accounts for sampling variability. Even weak models might generate correct code given many samples. Pass@K is standard for coding benchmarks. Higher K masks model weaknesses. Pass@1 is most realistic. This metric is standard in code evaluation.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →