A Mixture of Experts model contains many "expert" sub-networks; a routing layer picks which experts handle each token. The total parameter count can be enormous (Mixtral 8x7B totals 47B parameters) but only a fraction is active per inference step (Mixtral activates ~13B per token).
MoE gives the quality of a large model at the inference cost of a smaller one. Mixtral, DeepSeek, and rumoured GPT-4 architectures all use MoE. The trade-off is more complex training and serving infrastructure.