This is a pretty standard technical term in machine learning which isn’t necessarily a pejorative but a description of behavior. Memorization is different than generalization in that it doesn’t reflect some sort of synthesis from learned knowledge but rather repeating something from training directly. It usually indicates overfitting and is distinct than responding appropriately to something that wasn’t specifically trained on, which is generalization.
They do generalize. The claim is that the fine details are not the result from generalization, but from repeating test data verbatim. That seems consistent both with my intuitive understanding of neural networks and with the behavior I've observed, so I'm inclined to agree. So what does that mean? It means that while LLM:s can produce impressive output, the most impressive results that people are touting probably have a significant amount of verbatim training data in them. In other words, they're good but not as good as they seem to be.
Some of the most impressive things I've seen LLMs do is take a grammar in, in a format that does not strictly conform to any specific variant of a formal grammar notation, then generate output that conforms to that grammar, and reason about why it conforms, and how.
Most people would struggle immensely with a task like that even if handed a textbook on the subject, and no amount of training data happens to contain text in or about a language governed by a grammar of random nonsense I fed in.
There are areas where their reasoning is really awful. Ironically that's often when they seem most human-like. E.g. I just had a lengthy "argument" with ChatGPT over comparing the theoretical computational power of Markov Decision Processes vs. Turing Machines with various assumptions about the decision maker in the MDP, and it's reasoning was riddled with logical fallacies that I could very well see a high school students confronted with trying to compare the two based on a Wikipedia level of understanding of either without sufficient understanding to reason about how different aspects can be made to model the other.
But there are plenty of areas where you can get them to produce good results where the "fine details" could not possibly be repeated verbatim from the test data because they didn't exist prior to the conversation.
Mmm the most impressive thing I see LLMs do is take a piece of unstructured input and transform it in some way - summarize, extract information as JSON, etc. This wouldn't be possible if it were repeating training data verbatim, since it works on new novel inputs.