> There are about 936 tokens with very low L2 norm, centered at about 2. This likely means that they did not occur in the training process of GPT-oss and were thus depressed by some form of weight decay.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
Could it instead be the case that these tokens were initialized at some mean value across the dataset (plus a little noise), and then never changed because they were never seen in training? Not sure if that is state of the art anymore but e.g. in Karpathy's videos he uses a trick like this to avoid the "sharp hockey stick" drop in loss in the early gradient descent steps, which can result in undesirably big weight updates.
Unfortunately the article glances over some of practices of uncovering such patterns in the training data. It goes very straitghfully to the point, no lube needed. It didn't land well for me.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
E.g., they exclude them in minGPT: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...