I find the entire field lacking when it comes to long-horizon problems. Our current, widely used solution is to scale, but we're nowhere near achieving the horizon scales even small mammal brains can handle. Our models can have trillions of parameters, yet a mouse brain would still outperform them on long-horizon tasks and efficiency. It's something small, simple, and elegant—an incredible search algorithm that not only finds near-optimal routes but also continuously learns on a fixed computational budget.
I'm honestly a bit envious of future engineers who will be tackling these kinds of problems with a 100-line Jupyter notebook on a laptop years from now. If we discovered the right method or algorithm for these long-horizon problems, a 2B-parameter model might even outperform current models on everything except short, extreme reasoning problems.
The only solution I've ever considered for this is expanding a model's dimensionality over time, rather than focusing on perfect weights. The higher dimensionality you can provide to a model, the greater its theoretical storage capacity. This could resemble a two-layer model—one layer acting as a superposition of multiple ideal points, and the other layer knowing how to use them.
When you think about the loss landscape, imagine it with many minima for a given task. If we could create a method that navigates these minima by reconfiguring the model when needed, we could theoretically develop a single model with near-infinite local minima—and therefore, higher-dimensional memory. This may sound wild, but consider the fact that the human brain potentially creates and disconnects thousands of new connections in a single day. Could it be that these connections steer our internal loss landscape between different minima we need throughout the day?
Yes... The field lacks the HOLY GRAIL (long-horizon problems). But we don't need a mouse-brain to sort spam emails. The Hail Mary 2B+ parameter models and above are still niche uses of these algorithms (too heavy to run practically). There is plenty of room for clever and small models running on limited hardware and datasets to solve useful problems and nothing more.
Models that change size as needed have been experimented with, but they are either too inefficient or difficult to optimize at a limited power budget. However, I agree that they are likely what is needed if we want to continue to scale upward in size.
I suspect the real bottleneck is a breakthrough in training itself. Backpropagation loss is too simplistic to optimize our current models perfectly, let alone future larger ones. But there is no guarantee a better alternative exists which may create a fixed limit to current ML approaches.
I'm not proposing a change in model size; rather, I'm suggesting a higher dimensionality within the current structure. There’s an interesting paper on LLM explainability, which found that individual neurons often represent a superposition of various data elements.
What I’m advocating is a substantial increase in this aspect—keeping model size the same while expanding dimensionality. The "curse of dimensionality" illustrates how a modest increase in dimensions leads to a significantly larger volume.
While I agree that backpropagation isn’t a complete solution, it’s ultimately just a stochastic search method. The key point here is that expanding the dimensionality of a model’s space is likely the only viable long-term direction. To achieve this, backpropagation needs to work within an increasingly multidimensional space.
A useful analogy is training a small model on random versus structured data. With structured data, we can learn an extensive amount, but with random data, we hit a hard limit imposed by the network. Why is that?
I'm honestly a bit envious of future engineers who will be tackling these kinds of problems with a 100-line Jupyter notebook on a laptop years from now. If we discovered the right method or algorithm for these long-horizon problems, a 2B-parameter model might even outperform current models on everything except short, extreme reasoning problems.
The only solution I've ever considered for this is expanding a model's dimensionality over time, rather than focusing on perfect weights. The higher dimensionality you can provide to a model, the greater its theoretical storage capacity. This could resemble a two-layer model—one layer acting as a superposition of multiple ideal points, and the other layer knowing how to use them.
When you think about the loss landscape, imagine it with many minima for a given task. If we could create a method that navigates these minima by reconfiguring the model when needed, we could theoretically develop a single model with near-infinite local minima—and therefore, higher-dimensional memory. This may sound wild, but consider the fact that the human brain potentially creates and disconnects thousands of new connections in a single day. Could it be that these connections steer our internal loss landscape between different minima we need throughout the day?