Hacker News new | past | comments | ask | show | jobs | submit login
Training LLMs over Neurally Compressed Text (arxiv.org)
10 points by wseqyrku 8 months ago | hide | past | favorite | 3 comments



Can someone explain why this is better than using a larger tokenizer? To me it seems like this would just make the LLM have a harder time understanding the content (when a token might have multiple meanings and isn't full, it can't have a good embedding)


A token already has multiple meanings because words (and part-of-words) can have multiple meanings.


Sure, there is some of the same problem with current tokenizers. However, I think this would increase it from "some tokens aren't words" to "(almost) all tokens aren't words". Correct me if I'm missing something though.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: