An alternative approache to BPE tokenization https://arxiv.org/abs/2406.19223

vessenes · 2024-07-06T20:10:33 1720296633

T-FREE is interesting, at least, I find it interesting in that I don’t really understand it. They take successive character triples of all words, and then hash them, and then use the hash table slots landed in as destinations to feed into an embedding space? Can I possibly be understanding that chart properly?

Can you explain this any better than the first few pages of the paper? I’d like some intuition about why T-FREE works; there are lots of reasons to prefer different tokenization schemes, but I can’t really get this one into my head from the paper, unfortunately.

amrb · 2024-07-08T08:59:58 1720429198

Can't say I mastered the concept either, I'm waiting for the code [0] to be release so I can run some head-to-head tests.

[0] https://github.com/Aleph-Alpha/trigrams