Since I first learned about grokking, I've had a strong suspicion that getting a handle on it and figuring out how to aid it should be the central question in AI. We are currently stuck in a local minimum, where memorizing circuits perform well-enough to handle a whole lot of economically viable use cases. But the profit function has guided us into a valley dominated by a data and compute hungry architecture that isn't ideal for learning generalizing circuits (partially because the memorizing circuits are so effective! We relatively quickly get to a flat loss landscape, after which we blindly jump around for countless epochs in a kind of Brownian motion until we get into an area where regularizers can drive generalization). Research like this paper is incredibly important.
I thought this was the most interesting bit from the paper:
> Training data distribution, instead of training data size, qualitatively influences generalization behavior.
If there are many examples of X on the web, and only a few examples of related Y, would you actively prohibit the additional X samples to be included in the training set to prevent the reinforcement of the common case?
For the purpose of generalization, I think that's more treating a symptom than the root cause.
The better approach IMO would be finding architectures that heavily penalize the formation of memorizing and interpolating circuits. E.g much stronger weight decay than used today.
Kind of, but that's not inherent to the over reliance on memorization. I suspect using the top tier models from any of OAI, Anthropic, or Google would have resulted in much less embarrassing results, and I believe they're all primarily memorizers, not generalizers.
The search issue happened because Google had to use a really cheap model to power the search results, and a memorizing model that cheap is going to be highly constrained in capabilities (at least right now).
It doesn't help that they were caught with their pants down back when GPT-3 first entered the zeitgeist (and thus also the radar of Google's institutional stakeholders). Haste made waste, corners got cut, cargo went overboard.
Reminds me of that old quote about “the difference between average and state of the art is forgetting to turn it off over summer break” or similar.
I wonder if this is why smaller LLMs seem to punch above their weight, are they further along in the process of distilling the data down into understanding?
> leave it training. I’ve often seen people tempted to stop the model training when the validation loss seems to be leveling off. In my experience networks keep training for unintuitively long time. One time I accidentally left a model training during the winter break and when I got back in January it was SOTA (“state of the art”).
(This is not the oldest version, and doesn't mention the NN in question, but I believe this was about Neuraltalk, his image captioner.)
Not about that quote in particular, but the original grokking paper from Powers et al came about because they accidentally left a training job to run too long (at least as related by gwern).
I'm not sure they've written that anywhere else (which is a cautionary lesson for anyone trying to understand how research happens, BTW), but as further evidence besides just 'Ethan told me so on EAI Discord [IIRC]', you can see that in the original Reddit discussion where I mention Caballero's poster conversation, the lead author comments several times and doesn't contradict that anecdote: https://www.reddit.com/r/mlscaling/comments/n78584/grokking_...
I've read research that showed that among people who were top 1% skill wise, it was very common for them to have had early instructors who focused on skill development as play relative to the population at large. Because they learned to enjoy the process, the outcome didn't matter as much, and they could stay motivated to progress. Tournaments and other competitive activities also provide a way to maintain motivation - they're often graded so you can compete against others of a similar skill level, and competitive success can provide motivation to train harder even when absolute progress slows down.
I just learned about grokking; reminds me of double descent, and I looked up a 2022 paper called "Unifying grokking and double descent". I'm still unclear on what the difference is. My basic understanding of double descent was that the regularization loss made the model focus on regularization after fitting the train data.
Grokking is a sudden huge jump in test accuracy with increasing training steps, well after training accuracy has fully converged. Double descent is test performance increasing, decreasing, and then finally rising again as model parameters are increased.
What they share is a subversion of the naive framework that ML works simply by performing gradient descent over a loss landscape. Double descent subverts it by showing that learning isn't monotonic in parameter count; grokking subverts it by learning after training convergence.
I'd put the lottery ticket hypothesis in the same bucket of "things that may happen that don't make sense at all for a simple optimization procedure."
My takeaway from the paper is that you can guide training by adding/switching to a more difficult loss function after you got the basics right. Looks like they never got to overfitting grokking, so maybe there’s more to discover further down the training alley.
This paper feels way too abstract, to the point it makes it hard to understand what the team actually did.
For instance, the paper claims it beat GPT-4-Turbo and Gemini-Pro-1.5 on certain tasks... but it doesn't include any of the questions they asked GPT4 or Gemini, so it's hard to guess whether these results have any value at all.
It's also unclear what they even trained their custom transformer to do. It has a custom tokenizer, but they don't give a list of tokens (aside from a few examples in the diagrams like "Barrack", "Michelle", "Trump"). They talk about in-distribution and out-of-distribution tasks, but they don't give any examples of these tasks and what they look like.
This feels like accidental complexity. It wouldn't have been hard to add a few more appendices with eg a list of 20 or so in-distribution sentences they asked the model to complete and 10 out-of-distribution sentences. Instead all they include is diagrams comparing performance for different hyperparameters and stuff, but we don't even know what the models are being tested on.
Feels like science papers need a comment section. Replace peer-review with public-review. A way for authors to interact with the larger (science) community.
Openreview is nice. I guess it could integrate with arxiv to allow preprints but someone needs to pay for moderation if we are to keep a high standard of comments.
You got me curious so I unzipped the linked drive files. As a taster, here's a file "gemini_retrieval_cot_3.txt" from LLM.zip:
Looking through the facts, we find the following:
* Mary is older than Kristin.
* Kristin is younger than Donya.
Since Mary is older than someone who is younger than Donya, we can conclude that
Mary is older than Donya.
Final Answer: older
Some sets of files contain just the answer "older" or "younger".
Other sets of files are as above, a text output with reasoning leading to an older/younger/cannot decide result.
Overall it looks like the knowledge graph and reasoning was all using this pattern of age comparison problems.
Another result, from "gpt4turbo_retrieval_cot_88.txt":
To determine the relative ages of Rachel and Andres, we need to find a connection or a common
reference point between them through the relationships provided. Let's analyze the information:
1. Rachel is older than Maurice. (Rachel > Maurice)
2. Maurice is older than Josephine. (Maurice > Josephine)
3. Josephine is older than Doreen. (Josephine > Doreen)
4. Doreen is younger than Andres. (Andres > Doreen)
From these relationships, we can establish a chain:
- Rachel > Maurice > Josephine > Doreen
- Andres > Doreen
Since both Rachel and Andres are older than Doreen, and Rachel is higher up in the chain above Doreen compared to Andres, we can infer:
- Rachel > Andres
Final Answer: older
EDIT:
Found the problem statements. They're too big to paste on in its entirety, but roughly, from "prompt_cot_3.txt" used for the first answer above, the first line is "Hi! I have some facts for you:", then after a blank there's a single line with thousands (not exaggerated) of age facts, either in the form "X is older than/younger than/the same age as Y." or "The age of X is N.", and finally after another blank line, "Based on these facts, is Mary younger, older or in the same age as Donya? You can think step by step through the problem. Begin your final answer by 'Final Answer: '. Your final answer should be one of ['younger', 'older', 'same age', 'cannot decide']."
> Since Mary is older than someone who is younger than Donya, we can conclude that
Mary is older than Donya.
Unfortunately, though, this reasoning is just wrong. If Mary is 30, Kristin is 20, and Donya is 40, then Mary is older than Kristin and Kristin is younger than Donya, but Mary is not older than Donya.
Both answers are wrong. I didn't look at many files, and not all of them had the reasoning in them, but it was fairly easy to find examples of wrong answers based on the reasoning in the file.
I always like example success and failure prompts, too. They do say they generate a random knowledge graph, and then ask for one-hop results from the graph, and they do give two examples: Biden/Trump age comparison, and Barack/Michele wife age.
They also say that they fit all (Gemini) or 1/3 (RAG for GPT-4 and Gemini) of all the knowledge graph in the prompt, so to be fair, I wouldn't say they're hiding the ball on the prompts here, but that the prompts are very long, even one would significantly multiply the length of the PDF.
Again, I wouldn't mind some excerpts, just like you.
> even one would significantly multiply the length of the PDF.
That bit feels like you're playing devil's advocate. Including a prompt wouldn't significantly add to the length of the PDF unless you did it in the most obtuse, malicious-compliance-ish way possible.
And when the subject is "we got X performance on GPT-4", including (an abridged version of) the prompt isn't just a nice bonus, it's absolutely essential to judge the results. The perf data they give for GPT-4 is worthless without that information.
I sort of wish that we would move on from the "grokking" terminology in the way that the field generally uses it (a magical kind of generalization that may-or-may-not-suddenly-happen if you train for a really long time).
I generally regard grokking as a failure mode in a lot of cases -- it's oftentimes not really a good thing. It tends to indicate that the combination of your network, task, and data are poorly suited for learning {XYZ} thing. There are emergent traits which I think the network can learn in a healthy manner over training, and I think that tends to fall under the 'generalization' umbrella.
Though I'd strongly prefer to call it 'transitive' rather than 'compositional' in terms of generalization, as transitive is the formal term most disciplines use for such things, compositional is a different, more general meaning entirely. Similarly, I'd replace 'parametric' and 'non-parametric' with 'internal' and 'external', etc. Sloughing through the definition salad of words (this paper alone takes up roughly half of the top Kagi hits for 'parametric memory') makes actually interpreting an argument more difficult.
One reinterpretation of the problem is -- of course external memory models will have trouble generalizing to certain things like models relying on internal memory do! This is because, in part, models with internal memory will have much more 'experience' integrating the examples that they've seen, whereas, for an external-memory model like a typical RAG setup, anything is possible.
But, that being said, I don't think you can necessarily isolate that to the type of memory that the model has alone, i.e., I don't think you can clearly say even in a direct comparison between the two motifs that it's the kind of memory itself (internal vs. external) that is to blame for this. I think that might end up leading down some unfruitful research paths if so.
That said, one positive about this paper is the fact that they seem to have found a general circuit that forms for their task, and analyze that, I believe that has value, but (and I know I tend to be harsh on papers generally) the rest of the paper seems to be more of a distraction.
Definitional salad buffets and speculation about the 'in' topics are going to be the things that make the headlines, but in order to make real progress, focusing on the fundamentals is really what's necessary here, I think. They may seem 'boring' a lot of the times, but they've certainly helped me quite a bit in my research. <3 :'))))
One of the biggest bottlenecks of multi layer transformers is that reasoning can only happen in the hidden layers. Past the final layer, the model must generate a token that conforms to the training process. This token can then be fed back into the transformer from the beginning, but since it necessarily must be in natural language, it limits the type of reasoning the model can perform to the "thoughts" it has seen in the dataset and is therefore allowed to express. If you could figure out how to have the first layer take both the KV of the first layer and the KV of the final layers in the attention mechanism into account, the model would become capable of infinite length reasoning.
The final layer of the transformer prior to the logits pretty much does what you want already. The KV of the final layers are taken into account when generating the final hidden state for your new token. The first layers of the model are really just contextualizing the token within the sentence, forcing hidden state from a higher layer through lower layers isn't going to help you much. Replacing the last 5 or so layers with an RNN could certainly be interesting though.
How do we actually implement this? Imma struggling to work out how I could use this rather than my stupid langraph, recursive rag checking crap that takes too much time and never really does justice.
This is the interesting result where their GPT-2 sized transformer blows away GPT4 and Gemini 1.5 in connecting together facts
The difficulty of such a task is two-fold. First, the search space is large. For example, on average, each query entity connects with more than 50 facts, and each bridge entity in the ground truth proof connects with more than 900 facts. Second, there are no surface form clues to exploit and bias the search towards the ground truth proof, unlike most conventional QA benchmarks where the proof steps are transparent from the query.
To test LLMs based on non-parametric memory, we translate the facts into natural language by simple templates (Appendix F). Facts/queries for each attribute are grouped/tested separately. We test both the vanilla setup where all facts (28.2K on average) are loaded into the LLM context, and the retrieval-augmented setup (5.4K facts retrieved on average) where the two-hop neighborhoods of the two query entities are retrieved, which includes enough facts to deduce the answer. We also try both standard prompting where the model answers directly, and chain-of-thought (CoT) prompting where the model is prompted to verbalize the reasoning. We test GPT-4-Turbo and Gemini-Pro-1.5, where for GPT-4-Turbo we only test the retrieval-augmented setup due to context length limit.
Table 1:Results on the complex reasoning task. Direct/CoT: predict the answer directly/verbalize the reasoning steps. “+R”: retrieval augmentation.
GPT-4-Turbo Gemini-Pro-1.5 Grokked Transformer
Direct+R CoT+R Direct CoT Direct+R. CoT+R
Accuracy (%) 33.3 31.3 28.7 11.3 37.3 12.0 99.3
“ Second, there are no surface form clues to exploit and bias the search towards the ground truth proof”
Foundational models are usually trained with a lot of stuff before doing these kinds of tests. Can we know the above statement is true? That something (a) wasn’t in the training data and (b) didn’t have surface-level clues a ML algorithm could spot which the authors didn’t?
I felt like asking the latter because both GA’s and NN’s have found simple patterns in problems that humans missed for a long time. They used those patterns to heuristically solve those problems. It might be hard to design tests that eliminate a factor humans can’t see.
Had the exact same thought after reading the abstract… FWIW, delve only appears in the abstract. Having not read the rest of the paper yet, I might give the authors the benefit of the doubt that they used an LLM to summarize their findings for the abstract, but didn't abuse an LLM in writing the entire paper.
Putting aside the possibility that they just happened to use the word “delve,” IMO we still have to figure out the convention for this sort of thing. I don’t particularly value the time scientists spend writing the prose around their ideas, the ideas themselves are the valuable part.
One possibility, for example, could be journals allow AI written submissions but also require and distribute the prompts. Then we could just read the prompts and be spared stuff like the passive voice dance.
They probably abused a compiler to generate their program instead of writing it in assembly.
Soon AI will turn a chickenscrath of notes into a wonderful email. And then turn it back automatically for the end reader.
We put to much emphasis on the look rather than the substance. People are afraid to send out an email with 2 words: Meeting Friday and instead pad it out with pleasantry and detail, context and importance, but none of that really matters.
It's not enough information no matter who it is. If it's someone with enough political, social, or institutional capital you might overlook the annoyance but it still only tells you when. Doesnt say the what the when or the who, all of which have consequences for what I need to do to be prepared.
‘Meeting Friday’ was the message. You completely ignored the rest. It was just extra padding (intentionally so). Maybe 2 words is too short. But can you honestly tell me that the majority of emails you receive is suscinct and to the point? Or do you simply skim them for highlights and extract what is relevant to you?
That’s really the take away I was trying to get at. People equate quantity to quality far too often. We send way more content than we need to out of fear that someone will equate less with bad.
No, LLMs are deterministic. What you are describing is a randomized seed, which is another input to the LLM. Some interfaces expose this input, and some do not.
A single word is insufficient evidence to conclude that an LLM was used. "Delve" may be low frequency in naturalistic text but there are many words in an article and the chance that some of them will be low-frequency is high. I also checked in my bibliography and found that "delve" is actually not super rare in academic papers including those written before LLMs.
With a quick skim, the paper delivers on its promise. It's not a particularly long or difficult paper to follow.
> Causal tracing. The transformer could be viewed as a causal graph that propagates information from the input to the output through a grid of intermediate states, which allows for a variety of causal
analyses on its internal computation
> [...] There are in total three steps:
> 1. The normal run records the model’s hidden state activations on a regular input [...]
> 2. In the perturbed run, a slightly perturbed input is fed to the model which changes the prediction, where again the hidden state activations are recorded. [...] Specifically, for the hidden state of interest, we replace the input token at the same position as the state to be a random alternative of the same type (e.g., r1 → r′1) that leads to a different target prediction (e.g., t → t′).
> 3. Intervention. During the normal run, we intervene the state of interest by replacing its activation with its activation in the perturbed run. We then run the remaining computations and measure if the target state (top-1 token through logit lens) is altered. The ratio of such alterations (between 0 and 1) quantitatively characterizes the causal strength between the state of interest and the target.
> The generalizing circuit. [...] The discovered generalizing circuit (i.e., the causal computational pathways after grokking) is illustrated in Figure 4(a). Specifically, we locate a highly interpretable causal graph consisting of states in layer 0, 5, and 8, [...]. Layer 5 splits the circuit into lower and upper
layers, where 1) the lower layers retrieve the first-hop fact (h, r1, b) from the input h, r1, store the bridge entity b in S[5, r1], and “delay” the processing of r2 to S[5, r2]; 2) the upper layers retrieve the second-hop fact (b, r2, t) from S[5, r1] and S[5, r2], and store the tail t to the output state S[8, r2].
> What happens during grokking? To understand the underlying mechanism behind grokking, we track the strengths of causal connections and results from logit lens across different model checkpoints during grokking (the “start” of grokking is the point when training performance saturates). We observe two notable amplifications (within the identified graph) that happen during grokking. The
first is the causal connection between S[5, r1] and the final prediction t, which is very weak before grokking and grows significantly during grokking. The second is the r2 component of S[5, r2] via logit lens, for which we plot its mean reciprocal rank (MRR).
Additionally, we find that the state S[5, r1] has a large component of the bridge entity b throughout grokking. These observations strongly suggest that the model is gradually forming the second hop in the upper layers (5-8) during grokking. This also indicates that, before grokking, the model is very likely mostly memorizing the examples in train_inferred by directly
associating (h, r1, r2) with t, without going through the first hop
> Why does grokking happen? These observations suggest a natural explanation of why grokking happens through the lens of circuit efficiency. Specifically, as illustrated above, there exist both a memorizing circuit Cmem and a generalizing circuit Cgen that can fit the training data [...]
I thought this was the most interesting bit from the paper:
> Training data distribution, instead of training data size, qualitatively influences generalization behavior.