DS-MoE: Making MoE Models More Efficient and Less Memory-Intensive

Community Article Published April 9, 2024

Estimated reading time: 4 minutes

Mixture-of-Experts (MoE) language models are known for their ability to reduce computing needs by 2 to 4 times compared to traditional dense models, without sacrificing performance. This makes them especially useful in situations where computing resources are limited. However, MoE models typically need 2 to 4 times more parameters to perform as well as a dense model. For example, models like DeepSeekMoE-16B and Qwen1.5-MoE-A2.7B which has 16B parameters were designed to match the performance of a 7B model. The large number of parameters in MoE models incurs larger GPU memory requirements which makes them less efficient in I/O-bounded scenarios like autoregressive generation.

Alternative Text
Figure 1: Decoding Throughput of Dense Models and the SMoE Models with Similar Performance. We test the performance under the setup where only the input length is 1 and output length is 512. This study reveals that traditional Sparse Mixture of Experts (SMoE) models exhibit reduced output throughput in I/O-bounded situations despite their lower computational demands. The models are tested with HuggingFace Transformers.

Is it necessary for MoE models to be so large to achieve high performance? Can we create an MoE model that maintains performance but uses fewer parameters and less computational power? Enter DS-MoE. This model achieves similar performance to dense models but uses about one-third of the computational resources and only half as many parameters as other MoE models.

Alternative Text
Figure 2: Number of Parameters for Performance-Matched Models. We plot the size and computational profiles of the Dense-3B, SMoE-5B, and DS-MoE-3B models trained with 100B tokens, each achieving a comparable averaged task performance. DS-MoE demonstrates both computational efficiency and parameter efficiency, where the computational cost is quantified by counting the number of active parameters engaged during inference.

The concept of DS-MoE involves densely training the experts and forcing the model's routers to gradually ignore unnecessary experts for a given token. We employ the Mutual Information (MI) loss to the training process, which balances the load of each expert across the entire batch, but also encourages each input token to concentrate their gating probability to fewer experts.

Alternative Text
Figure 3: Subfigure (a) illustrates the conventional sparse training method in MoE models, characterized by sparse gradient propagation in both the router and the experts. Subfigure (b) details the dense training strategy in DS-MoE, which involves dense propagation of gradients for both routers and experts.

The MI loss is defined as LMI=H(e)+1XxXH(ex),H(e)=i=1Np(e)log(p(e)), L_{MI} = - H(e) + {1 \over |X|}\sum_{x\in X} H(e|x), \quad H(e) = - \sum_{i=1}^{N}p(e)\log(p(e)), where X denotes the tokens in a minibatch, and e denotes the experts. Intuitively, maximizing H(e) balances the load of each expert across the entire batch, and minimizing H(e|x) encourages each input x to concentrate their gating probability to fewer experts.

During inference, DS-MoE chooses only the top K experts based on their scores. The determination of the number of K is based on either a predefined value or an adaptive method, contingent upon the count of experts with scores surpassing a certain threshold. As a result, DS-MoE can perform as well as similarly sized dense models while using far fewer active parameters, as demonstrated in the table.

Model HellaSwag PIQA WinoGrande SciQ Arc-e Arc-c Avg. Perf. Active Params
Dense-3B 40.4 71.4 58.7 86.0 59.6 26.1 57.0 2705M
SMoE-5B 40.1 70.7 56.5 85.6 58.4 24.8 56.0 1212M
DS-MoE-3B 39.3 71.6 57.9 85.6 57.7 24.9 56.2 934M
Dense-6B 44.3 72.2 59.9 88.0 62.9 27.9 59.2 6186M
DS-MoE-6B 43.5 73.0 57.9 86.9 61.9 27.9 58.5 1813M

We also tested DS-MoE with vLLM to see how it compares to other models in terms of processing speed and memory usage at the 7B performance tier. We looked at how many requests and tokens it could handle per second, using a setup where each input and output consisted of 1,000 tokens and the GPU memory usage was capped at 90%.

Model Total Params Active Params Model Memory A100 Throughput A100 TPS H100 Throughput H100 TPS
Dense-6B 6.4B 6.4B 12.3 GiB 1.04 2079.8 1.40 2808.7
Mistral-7B 7.2B 7.2B 13.5 GiB 1.07 2140.8 1.52 3047.4
DeepSeekMoE 17.3B 2.8B 30.5 GiB 1.17 2330.1 1.57 3144.1
Qwen1.5-MoE 16.4B 2.7B 26.7 GiB 1.33 2665.7 1.81 3616.9
DS-MoE-6B 6.5B 2.2B 12.6 GiB 2.00 3992.8 2.30 4603.9

The test shows that DS-MoE outperforms both dense models in terms of computational cost and sparsely trained MoEs in model memory, leading to faster processing in the computation-bounded scenarios as well the I/O bounded scenarios. Note that DS-MoE-6B is not yet comparable with other models regarding downstream performance, due to its training on merely 100 billion tokens (versus the trillions for other models). Nevertheless, DS-MoE has demonstrated significant promise in achieving the performance levels of dense models with a comparable volume of training data.

Read More in the Paper