For a lot of the content they were trained on, it seems like the easiest way to predict the next token would be to model the world or work with axioms. So how do we know that an LLM isn't doing these things internally?
In fact, it looks like the model is doing those things internally.
We argue that representations in AI models, particularly deep networks, are converging. First, we survey many examples of convergence in the literature: over time and across multiple domains, the ways by which different neural networks represent data are becoming more aligned. Next, we demonstrate convergence across data modalities: as vision models and language models get larger, they measure distance between datapoints in a more and more alike way. We hypothesize that this convergence is driving toward a shared statistical model of reality, akin to Plato’s concept of an ideal reality. We term such a representation the platonic representation and discuss several possible selective pressures toward it. Finally, we discuss the implications of these trends, their limitations, and counterexamples to our analysis.
Unless I misread this paper, their argument is entirely hypothetical. Meaning that the LLM is still a black box and they can only hypothesise what is going internally by viewing the output(s) and guessing at what it would take to get there.
There's nothing wrong with a hypothesis or that process, but it means we still don't know whether models are doing this or not.
Maybe I mixed up that paper with another but the one I meant to post shows that you can read something like a world model from the activations of the layers.
There was a paper that shows a model trained on Othello moves creates a model of the board, models the skill level of their opponent and more.
Well my understanding is that there's ultimately the black box problem. We keep building these models and the output seems to get better, but we can't actually inspect how they work internally.