Architecture

Multi-Head Attention

Quick Answer

Applying attention multiple times in parallel with different learned representations.

Multi-head attention applies the attention mechanism multiple times (e.g., 8 or 16 heads) in parallel. Each head learns to focus on different aspects or features. For instance, one head might capture semantic relationships while another captures syntactic structure. The outputs of all heads are concatenated and projected. Multi-head attention is more expressive than single-head attention. The number of heads is a design choice balancing computational cost and model expressiveness. Empirically, more heads generally help but have diminishing returns.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →