Architecture
Attention Mechanism
Quick Answer
A neural network component that weights the importance of different input tokens.
Attention allows the model to focus on relevant parts of input when making predictions. It computes similarity scores between all token pairs, then uses these scores to weight how much each token contributes to the output. This enables handling of long-range dependencies. Attention is the core innovation in transformers. Multiple attention heads allow the model to focus on different aspects simultaneously. Attention can attend to input (encoder), previous outputs (autoregressive), or both (cross-attention). Computational cost scales quadratically with sequence length, which limits context windows.
Last verified: 2026-04-08