That's just making the architecture work better. I'm old enough to remember when...

That's just making the architecture work better.

I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.

The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.