MoEs Surpass Dense LLMs

Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?

Houyi Li^1,2* Ka Man Lo^1* Ziqi Wang¹ Zili Wang¹ Wenzhen Zheng¹
Shuigeng Zhou² Xiangyu Zhang^1,3 Daxin Jiang¹

¹StepFun ²Fudan University ³Megvii Technology
^*Equal contribution

Abstract

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints — that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for the enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens.

Three-step Experimental Methodology

Drawing insights from a unified parameterization framework for model architecture (see our paper for details), we propose a three-step experimental methodology:

Search for an optimized architecture design to ensure each model candidate achieves its (near-)optimal performance.

Explore the optimal activation rate (AR) based on the optimized model architecture, keeping the total parameters and compute budget fixed.

Present a data reuse strategy to address the additional data demand of MoE models, thereby equating data resources.

Optimal Activation Rate

Based on our optimized model backbones (see our paper for details), we build a series of MoE models with non-vocabulary parameters $N \approx 2\text{B}$ (Figure 1) and $N \approx 7\text{B}$ (Figure 2). Our experimental results identify the existence of an optimal AR region, where MoE models can outperform their dense counterparts under the same training budget. Furthermore, the optimal AR point within this region , $r_\text{a}^{**} \approx 20\%$, remains consistent across different model sizes.

2B models with fixed data or activation rate — Figure 1: Performance of $N \approx 2\text{B}$ models trained with varying data sizes $D$ and activation rates $r_\text{a}$. **(a)** With a fixed $D$, the performance gain exhibits a **non-linear** dependence on the training budget $C$. Conversely, with a fixed $r_\text{a}$, increasing $D$ results in a **linear** performance gain. These findings identify an optimal activation rate, $r_\text{a}^{**} = 20\%$, that remains consistent across different values of $D$ when $N$ is constant. **(b)** From the perspective of a fixed training compute $C$, the optimal activation rate $r_\text{a}^{**} = 20\%$ can be clearly observed.

2B models with fixed compute — Figure 1: Performance of $N \approx 2\text{B}$ models trained with varying data sizes $D$ and activation rates $r_\text{a}$. **(a)** With a fixed $D$, the performance gain exhibits a **non-linear** dependence on the training budget $C$. Conversely, with a fixed $r_\text{a}$, increasing $D$ results in a **linear** performance gain. These findings identify an optimal activation rate, $r_\text{a}^{**} = 20\%$, that remains consistent across different values of $D$ when $N$ is constant. **(b)** From the perspective of a fixed training compute $C$, the optimal activation rate $r_\text{a}^{**} = 20\%$ can be clearly observed.

Data Reuse Strategy

To eliminate the additional data demand for MoEs to outperform their dense counterparts, we investigate data reusability by training models for multiple epochs using a fixed, smaller dataset. We explore two distinct data reuse schemes: the strict scheme ensures that both MoE and dense models are trained under completely equal conditions; the loose scheme relaxes the constraint of identical data volume by fixing the number of training epochs to 2.

As shown in Figure 2b, the strict data reuse scheme (blue dashed line) only marginally diminishes performance, and MoE models continue to outperform dense baselines. Surprisingly, the loose scheme (green dashed line) often outperforms training on the unique dataset for a single epoch.

7B models with fixed data — Figure 2: Performance of $N \approx 7\text{B}$ models trained with varying data sizes $D$ and activation rate $r_\text{a}$. The optimal activation rate, $r_\text{a}^{**}=20\%$, align with the findings for the 2B models (Figure 1). Additionally, compared to training on the unique dataset, the strict data reuse scheme shows only a slight performance reduction, while the loose scheme often yields better performance.

7B models with fixed compute — Figure 2: Performance of $N \approx 7\text{B}$ models trained with varying data sizes $D$ and activation rate $r_\text{a}$. The optimal activation rate, $r_\text{a}^{**}=20\%$, align with the findings for the 2B models (Figure 1). Additionally, compared to training on the unique dataset, the strict data reuse scheme shows only a slight performance reduction, while the loose scheme often yields better performance.

Analysis of Downstream Performance

To assess whether the optimal ARs generalize to downstream tasks, we conduct SFT on our 7B pre-trained models (trained w/ and w/o strict data reuse) and evaluate both the pre-trained models and SFT-ed models on a total number of 29 benchmarks (Figure 3; Table 2). The results demonstrate the universality of the optimal AR point ($r_\text{a}^{**} = 20\%$) across different training phases and data domains. See our paper for more analyses.

Downstream performance of 7B pretrained models — Figure 3: Downstream performance of 7B models: pre-trained (**top**) and SFT-ed (**middle and bottom**) versions. Across all benchmark types, MoE models with $r_\text{a}=20\%$ outperform dense model trained with twice the compute, aligning with upstream observations that the optimal AR is 20%.

Downstream performance of 7B SFT-ed models — Figure 3: Downstream performance of 7B models: pre-trained (**top**) and SFT-ed (**middle and bottom**) versions. Across all benchmark types, MoE models with $r_\text{a}=20\%$ outperform dense model trained with twice the compute, aligning with upstream observations that the optimal AR is 20%.

Table 2: Accuracy of 7B SFT-ed models across different benchmarks.

BibTeX

@misc{li2025mixtureofexpertssurpassdensellms, title = {Can Mixture-of-Experts Surpass Dense LLMs Under Strictly Equal Resources?}, author = {Houyi Li and Ka Man Lo and Ziqi Wang and Zili Wang and Wenzhen Zheng and Shuigeng Zhou and Xiangyu Zhang and Daxin Jiang}, year = {2025}, eprint = {2506.12119}, archivePrefix = {arXiv}, primaryClass = {cs.CL}, url = {https://arxiv.org/abs/2506.12119}, }