I agree. However, my point is that they have to compress information in nontrivial ways to achieve their goal. The typical training set of modern LLMs is about 20 trillion tokens of 3 bytes each. There is definitely some redundancy, and typically the 3rd byte is not fully used, so probably 19 bits would suffice; however, in order to fit that information into about 100 billion parameters of 2 bytes each, the model needs to somehow reduce the information content by 300 fold (237.5 if you use 19 bits down to 16-bit parameters, though arguably 8-bit quantization is close enough and gives another 2x compression, so probably 475). A quick check for the llama3.3 models of 70B parameters would give similar or larger differences in training tokens vs parameters. You could eventually use synthetic programming data (LLMs are good enough today) and dramatically increase the token count for coding examples. Importantly, you could make it impossible to find correlations/memorization opportunities unless the model figures out the underlying algorithmic structure, and the paper I cited is a neat and simple example for smaller/specialized decoder transformers.
A transformer is not a compressor. It's a transformer/generator. It'll generate a different output for an infinite number of different inputs. Does that mean it's got an infinite storage capacity?
The trained parameters of a transformer are not a compressed version of the training set, or of the information content of the training set; they are a configuration of the transformer so that its auto-regressive generative capabilities are optimized to produce the best continuation of partial training set samples that it is capable of.
Now, are there other architectures, other than a transformer, that might do a better job, or more efficient one (in terms of # parameters) at predicting training set samples, or even of compressing the information content of the training set? Perhaps, but we're not talking hypotheticals, we're talking about transformers (or at least most of us are).
Even if a transformer was a compression engine, which it isn't, rather than a generative architecture, why would you think that the number of tokens in the training set is a meaningful measure/estimate of it's information content?!! Heck, you go beyond that to considering a specific tokenization scheme and number bits/bytes per token, all of which it utterly meaningless! You may as well just count number of characters, or words, or sentences for that matter, in the training set, which would all be equally bad ways to estimate it's information content, other than sentences perhaps having at least some tangential relationship to it.
sigh
You've been downvoted because you're talking about straw men, and other people are talking about transformers.
I should have emphasized the words "nontrivial ways" in my previous response to you. I didn't mean to emphasize compression and definitely not memorization, just the ability to also learn algorithms that can be evaluated by the parallel decoder-transformer language (RASP-L). Other people had mentioned memorization or clustering/near neighbor algorithms as the main ways that decoder transformers works, and I pointed out a paper that cannot be explained in that particular way no matter how much one would try. That particular paper is not unique, and nobody has shown that decoder transformers can memorize their training sets, because they typically cannot, just because it is a numbers/compression game that is not in their favor and because typical training sets have strong correlations or hidden algorithmic structures that allow for better ways of learning. In the particular example, the training set was random data on different random functions and totally unrelated to the validation / test sets, so compressing the training set would be close to useless anyways and the only way for the decoder transformer to learn was to figure out an algorithm that optimally approximates the function evaluations.
The paper you linked is about in-context learning, an emergent run-time (aka inference time) capability of LLMs, which has little relationship to what/how they are learning at training time.
At training time the model learns using the gradient descent algorithm to find the parameter values corresponding to the minimum of the error function. At run-time there are no more parameter updates - no learning in that sense.
In-context "learning" is referring to the ability of the trained model to utilize information (e.g. proper names, examples) from the current input, aka context, when generating - an ability that it learnt at training time pursuant to it's error minimization objective.
e.g.
There are going to be many examples in the training set where the subject of a sentence is mentioned more than once, either by name or pronoun, and the model will have had to learn when the best prediction of a name (or gender) later in a sentence is one that was already mentioned earlier - the same person. These names may be unique to an individual training sample, and/or anyways the only predictive signal of who will be mentioned later in the sentence, so at training time the model (to minimize prediction errors) had to learn that sometimes the best word/token to predict is not one stored in it's parameters, but one that it needs to copy from earlier in the context (using a key-based lookup - the attention mechanism).
If the transformer, at run-time, is fed the input "Mr. Smith received a letter addressed to Mr." [...], then the model will hopefully recognize the pattern and realize it needs to do a key-based context lookup of the name associated with "Mr.", then copy that to the output as the predicted next word (resulting in "addressed to Mr. Smith"). This is referred to as "in-context learning", although it has nothing to with the gradient-based learning that takes place at training time. These two types of "learning" are unrelated.
Similar to the above, another example of in-context learning is the learning of simple "functions" (mappings) from examples given in the context. Just as in the name example, the model will have seen many examples in the training of the types of pattern/analogy it needs to learn to minimize prediction errors (e.g. "black is to white as big is to small", or black->white, big->small), and will hopefully recognize the pattern at run-time and again use an induction-head to generate the expected completion.
The opening example in the paper you linked ("maison->house, chat->cat") is another example of this same kind. All that is going on is that the model learnt, at training time, when/how to use data in the context at run-time, again using the induction head mechanism which has general form A':B' -> A:B. You can call this an algorithm if you want to, but it's really just a learnt mapping.
Thanks. I don’t think we disagree on major points. Maybe there is a communication barrier and it may be on me. I came from a computational math/science/statistics background to ML. These next token prediction algorithms are of course learned mappings. Not sure one needs anything else when the mappings involve reasonably powerful abilities. If you are perhaps from a pure CS background and you think about search, then, yes one could simply explore a sequence of A’:B’ -> A’’:B’’ -> … before finding A:B and use the conditional probability formula of the sequence as the guiding point for a best first search or MCTS expansion (if the training data had a similar structure). Are there other ways to learn that type of search? Probably. But what I meant above by algorithm is what you correctly understood as the mapping itself: the transformer computes intermediate useful quantities distributed throughout its weights and sometimes centered at different depths so that it can eventually produce the step mapping of A’:B’ -> A:B. We don’t yet have a clean disassembler to probe this trained “algorithm” so there are some rare efforts where we can map this mapping back to conventional pseudo-code but not in the general case (and I wouldn’t even know how easy it would be for us to work with a somehwat shorter but still huge functional form that translates English language to a different language, or to computer code.) Part of why o1-like efforts didnt start before we had reasonably powerful architectures and the required compute, is that these types of “algorithm” developments require large enough models (though we had those since a couple years now) and relevant training data (which are easier to procure/build/clean up with the aid of the early tools).