Architecture doesn't make a difference. Big enough models trained with big enoug...

elcomet · 2024-05-13T07:46:52 1715586412

That's not completely true. The architecture must behave well for scaling, which is not trivial. Basic multi-layer perceptrons do not scale well for example, the gradient will vanish or explode deeper in the network.

3abiton · 2024-05-13T08:15:26 1715588126

And data quality. Ensuring the sourcing and quality is very important to get a good model.

fleischhauf · 2024-05-13T11:12:51 1715598771

this, if you have money to spend in improving your model, more training data is the first thing I'd take a look at

Tarrosion · 2024-05-13T16:26:00 1715617560

How do modern foundation models avoid multi-layer perceptron scaling issues? Don't they have big feed-forward components in addition to the transformers?

elcomet · 2024-05-24T18:45:31 1716576331

They rely heavily on what we call residual or skip connexions. This means each layer does something like x = x + f(x). This helps the training a lot, ensuring the gradient can flow nicely in the whole network.

This is heavily used in ResNets (residual networks) for computer vision, and is what allows training much deeper convolutional networks. And transformers use the same trick.

heavenlyblue · 2024-05-13T16:36:22 1715618182

They don't do global optimisation of all layers at the same time, instead training all layers independently of each other.

mk67 · 2024-05-13T18:32:37 1715625157

I'm in the industry and nobody does that since over ten years. There was just a small phase when Hinton published "Greedy layer-wise training of deep networks" in 2007 and people did it for a few years at most. But already with the rise of LSTMs in the 2010s this wasn't done anymore and now with transformers also not. Would you care to share how you reached your conclusion as it matches 0 of my experience over the last 15 years and we also train large-scale LLMs in our company. There's just not much point to it when gradients don't vanish.

Tarrosion · 2024-05-14T17:49:23 1715708963

Why don't gradients vanish in large scale LLMs?

mk67 · 2024-05-19T08:46:47 1716108407

Not easy to give a concise answer here, but let me try:

The problem mainly occurs in networks with recurrent connections or very deep architectures. In recurrent architectures this was solved via LSTMs with the signal gates. In very deep networks, e.g. ResNet, this was solved via residual connections, i.e. skip connections over layers. There were also other advances, such as replacing sigmoid activations with the simpler ReLU.

Transformers, which are the main architecture of modern LLMs, are highly parallel without any recurrence, i.e. at any layer you still have access to all the input tokens, whereas in an RNN you process one token at a time. To solve the potential problem due to "deepness" they also utilize skip connections.

rfoo · 2024-05-13T09:06:14 1715591174

idk, they do give the same results, but given the memory bottleneck it feels like we are at a point when architecture innovations matter again, for example check out DeepSeek V2 tech report, they modded model arch specifically for lower cost inference (by making k/v cache smaller)

__loam · 2024-05-13T20:31:31 1715632291

Different architecture can result in hundreds of millions of dollars more in training costs no?

DeathArrow · 2024-05-14T05:56:52 1715666212

Sure, but the point wasn't about the costs, it was about the capability of a trained model.