Architecture
Grouped-Query Attention
Quick Answer
An attention variant where multiple query heads share key and value heads, reducing memory.
Grouped-query attention (GQA) reduces KV cache memory by having multiple query heads share fewer key and value heads. Standard multi-head attention has one KV head per query head. GQA might use 8 query heads per 1 KV head. This dramatically reduces KV cache memory (proportional to number of KV heads) without severely impacting quality. GQA is increasingly standard in modern efficient models. It's particularly valuable for long-context and high-throughput scenarios. GQA provides a good efficiency/quality tradeoff.
Last verified: 2026-04-08