I find the entire field lacking when it comes to long-horizon problems. Our curr...

aDyslecticCrow · 2024-10-04T13:40:21 1728049221

Yes... The field lacks the HOLY GRAIL (long-horizon problems). But we don't need a mouse-brain to sort spam emails. The Hail Mary 2B+ parameter models and above are still niche uses of these algorithms (too heavy to run practically). There is plenty of room for clever and small models running on limited hardware and datasets to solve useful problems and nothing more.

Models that change size as needed have been experimented with, but they are either too inefficient or difficult to optimize at a limited power budget. However, I agree that they are likely what is needed if we want to continue to scale upward in size.

I suspect the real bottleneck is a breakthrough in training itself. Backpropagation loss is too simplistic to optimize our current models perfectly, let alone future larger ones. But there is no guarantee a better alternative exists which may create a fixed limit to current ML approaches.

charlescurt123 · 2024-10-08T18:49:46 1728413386

I'm not proposing a change in model size; rather, I'm suggesting a higher dimensionality within the current structure. There’s an interesting paper on LLM explainability, which found that individual neurons often represent a superposition of various data elements.

What I’m advocating is a substantial increase in this aspect—keeping model size the same while expanding dimensionality. The "curse of dimensionality" illustrates how a modest increase in dimensions leads to a significantly larger volume.

While I agree that backpropagation isn’t a complete solution, it’s ultimately just a stochastic search method. The key point here is that expanding the dimensionality of a model’s space is likely the only viable long-term direction. To achieve this, backpropagation needs to work within an increasingly multidimensional space.

A useful analogy is training a small model on random versus structured data. With structured data, we can learn an extensive amount, but with random data, we hit a hard limit imposed by the network. Why is that?