Architecture

Sparse Mixture of Experts

Quick Answer

A MoE variant where only a small number of experts activate per token.

Sparse MoE activates only k experts (e.g., 2-8) out of many (e.g., 64) per token. This dramatically reduces compute while maintaining parameter count. Sparse MoE enables models with far more total parameters than could be trained densely. The router must learn to route tokens effectively. Load balancing (ensuring tokens are distributed across experts) is crucial. Sparse MoE enables efficient scaling. Inference requires less compute than dense models of equivalent quality.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →