Architecture

Mixture of Experts (MoE)

Quick Answer

An architecture where different experts handle different parts of the input conditionally.

Mixture of Experts (MoE) replaces dense feed-forward networks with multiple expert networks. A router network decides which experts to activate for each input. This enables scaling model capacity without proportionally increasing compute. Only activated experts process each token. MoE models can have far more parameters than dense models while using similar compute. However, MoE is harder to train, requires more memory, and has load balancing challenges. Sparse MoE activates few experts per token. Recent models like Mixtral use MoE successfully.

Last verified: 2026-04-08

Compare models

See how different LLMs compare on benchmarks, pricing, and speed.

Browse all models →