Architecture

Flash Attention

Quick Answer

An optimized attention algorithm that reduces memory I/O and increases GPU utilization.

Flash Attention is an efficient attention implementation reducing memory I/O costs. Standard attention materializes the full attention matrix (large and slow on memory). Flash Attention uses a clever tiling strategy to compute attention with fewer memory accesses. This results in faster and more memory-efficient attention, enabling longer contexts. Flash Attention enables training with longer sequences efficiently. It's now standard in modern implementations. Variants (Flash Attention 2, 3) continue improving efficiency. Flash Attention makes the attention bottleneck less severe.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →