I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.
The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.
I'm old enough to remember when everyone outside of a few weirdos thought that a single hidden layer was enough because you could show that type of neural network was a universal approximator.
The same thing is happening with the wide MoE models. They are easier to train and sound a lot smarter than the deep models, but fall on their faces when they need to figure out deep chains of reasoning.