Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints — that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens.
Drawing insights from a unified parameterization framework for model architecture (see our paper for details), we propose a three-step experimental methodology:
Based on our optimized model backbones (see our paper for details), we build a series of MoE models with non-vocabulary parameters $N \approx 2\text{B}$ (Figure 1) and $N \approx 7\text{B}$ (Figure 2). Our experimental results identify the existence of an optimal AR region, where MoE models can outperform their dense counterparts under the same training budget. Furthermore, the optimal AR point within this region , $r_\text{a}^{**} \approx 20\%$, remains consistent across different model sizes.
(a) Fixed $D$ (solid) or $r_\text{a}$ (dashed)
(b) Fixed $C$
To eliminate the additional data demand for MoEs to outperform their dense counterparts, we investigate data reusability by training models for multiple epochs using a fixed, smaller dataset. We explore two distinct data reuse schemes: the strict scheme ensures that both MoE and dense models are trained under completely equal conditions; the loose scheme relaxes the constraint of identical data volume by fixing the number of training epochs to 2.
As shown in Figure 2b, the strict data reuse scheme (blue dashed line) only marginally diminishes performance, and MoE models continue to outperform dense baselines. Surprisingly, the loose scheme (green dashed line) often outperforms training on the unique dataset for a single epoch.
(a) Fixed $D$
(b) Fixed $C$ and reusing data
To assess whether the optimal ARs generalize to downstream tasks, we conduct SFT on our 7B pre-trained models (trained w/ and w/o strict data reuse) and evaluate both the pre-trained models and SFT-ed models on a total number of 29 benchmarks (Figure 3; Table 2). The results demonstrate the universality of the optimal AR point ($r_\text{a}^{**} = 20\%$) across different training phases and data domains. See our paper for more analyses.
Mixture-of-Experts can surpass dense LLMs under equal total parameters, compute, and data constraints, provided that the backbones are optimized and the activation rates reside in the optimal region.
@misc{li2025mixtureofexpertssurpassdensellms,
title = {Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?},
author = {Houyi Li and Ka Man Lo and Ziqi Wang and Zili Wang and Wenzhen Zheng and Shuigeng Zhou and Xiangyu Zhang and Daxin Jiang},
year = {2025},
eprint = {2506.12119},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2506.12119},
}