Architecture doesn't make a difference. Big enough models trained with big enough data tend to give the same results regardless of architecture. So yes, most advances in AI are mostly due to the fact we can now multiply matrices very fast.
That's not completely true. The architecture must behave well for scaling, which is not trivial. Basic multi-layer perceptrons do not scale well for example, the gradient will vanish or explode deeper in the network.
How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?
They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.
This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.
I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish.
Not easy to give a concise answer here, but let me try:
The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.
Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.
idk, they do give the same results, but given the memory bottleneck it feels like we are at a point when architecture innovations matter again, for example check out DeepSeek V2 tech report, they modded model arch specifically for lower cost inference (by making k/v cache smaller)