They may have a better approach for MoE selection during training:
> The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies
in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise
auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce
in-domain balance on each sequence. This flexibility allows experts to better specialize in
different domains. To validate this, we record and analyze the expert load of a 16B auxiliary-
loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set.
As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater
expert specialization patterns as expected.
And they have shared experts always present:
> Compared with traditional MoE
architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and
isolates some experts as shared ones.
I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.
The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.
> The key distinction between auxiliary-loss-free balancing and sequence-wise auxiliary loss lies in their balancing scope: batch-wise versus sequence-wise. Compared with the sequence-wise auxiliary loss, batch-wise balancing imposes a more flexible constraint, as it does not enforce in-domain balance on each sequence. This flexibility allows experts to better specialize in different domains. To validate this, we record and analyze the expert load of a 16B auxiliary- loss-based baseline and a 16B auxiliary-loss-free model on different domains in the Pile test set. As illustrated in Figure 9, we observe that the auxiliary-loss-free model demonstrates greater expert specialization patterns as expected.
And they have shared experts always present:
> Compared with traditional MoE architectures like GShard (Lepikhin et al., 2021), DeepSeekMoE uses finer-grained experts and isolates some experts as shared ones.