Training LLMs over Neurally Compressed Text

KTibow · 2024-05-08T05:03:52 1715144632

Can someone explain why this is better than using a larger tokenizer? To me it seems like this would just make the LLM have a harder time understanding the content (when a token might have multiple meanings and isn't full, it can't have a good embedding)

kkzz99 · 2024-05-08T14:39:59 1715179199

A token already has multiple meanings because words (and part-of-words) can have multiple meanings.

KTibow · 2024-05-08T23:05:40 1715209540

Sure, there is some of the same problem with current tokenizers. However, I think this would increase it from "some tokens aren't words" to "(almost) all tokens aren't words". Correct me if I'm missing something though.