Architecture
Self-Attention
Quick Answer
Attention where tokens attend to other tokens in the same sequence.
Self-attention allows each token to attend to all other tokens in the same sequence. It enables the model to capture relationships and dependencies within the input. Self-attention is the fundamental mechanism enabling transformers to process sequences in parallel while understanding context. The model learns what to attend to (queries), what to search in (keys), and what information to extract (values). Multi-head self-attention applies this independently multiple times. Self-attention enables the model to handle long-range dependencies that RNNs struggle with.
Last verified: 2026-04-08