There’s some pretty nice fundamental research in here, and I appreciate the publ...

There’s some pretty nice fundamental research in here, and I appreciate the publication very much. What stood out to me is their discussion of the difficulties of using softmax against different tokenization spaces; super interesting analysis (they say different modalities compete by upping their own strength relative to other modalities, leading to divergence), and the ultimate fix, (I can’t remember it right now, and leave this as a tease to the interested paper reader).

They also noted the problem was most pronounced once they got up to 34b sized. It’s a good reminder training large scale models leads to new interesting problems. I imagine a lot of techniques and know how are not published; all those little bits of experience add up to a lot of competitive advantage in many venues, so once again, thanks to Zuck and co for publishing.