Architecture

Self-Attention

Quick Answer

Attention where tokens attend to other tokens in the same sequence.

Self-attention allows each token to attend to all other tokens in the same sequence. It enables the model to capture relationships and dependencies within the input. Self-attention is the fundamental mechanism enabling transformers to process sequences in parallel while understanding context. The model learns what to attend to (queries), what to search in (keys), and what information to extract (values). Multi-head self-attention applies this independently multiple times. Self-attention enables the model to handle long-range dependencies that RNNs struggle with.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →