They may have a better approach for MoE selection during training: > The key dis...

llm_trw · 2025-01-01T03:55:11 1735703711

That's just making the architecture work better.

I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.

The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.