I have heard of generalization vs memorization, but the article you shared is ve...

I have heard of generalization vs memorization, but the article you shared is very high quality. Thank you.

I do not think that SOTA LLMs demonstrate grokking for most math problems. While I am a bit surprised to read how little training is necessary to achieve grokking in a toy setting (one specific math problem), the domain of all math problems is much larger. Also, the complexity of an applied mathematics problem is much higher than a simple mod problem. That seems to be what the author of the first article you quoted thinks as well.

Our public models fail in that large domain a lot. For example, with tasks like counting elements in a set (words in a paragraph). Not to mention that they fail in complex applied mathematics tasks. If they have been loss-minimized for that specific calculation to the point that they exhibit this phase change, then that would be an exception.

But in the financial statement analysis article, the author says explicitly that there isn't a limitation on the types of math problems they ask the model to perform. This is very, very irregular, and there are no guarantees that model has generalized them. In fact, it is much more likely that it hasn't, in my opinion.

In any case, thank you again for the article. It's just such a massive contrast with the MBA article above.