Can someone explain why this is better than using a larger tokenizer? To me it seems like this would just make the LLM have a harder time understanding the content (when a token might have multiple meanings and isn't full, it can't have a good embedding)
Sure, there is some of the same problem with current tokenizers. However, I think this would increase it from "some tokens aren't words" to "(almost) all tokens aren't words". Correct me if I'm missing something though.