If I understand your question right, this is one of the reasons BPE is nice and the parent liked it. For any character sequence, provided the characters are in the alphabet used to create the BPE vocab, there are no unknown words/sequences. One downside of some previous tokenization methods is you could have unknown/UNK tokens, EG dictionary based methods.
In our paper with bytes, we also avoid the UNK issue, since we can have an embedding for every possible byte, since itβs not that many (and for sequences of bytes we use hash embedding, although we did test n-gram lookups for the top K frequent byte n-grams in the training data).
I don't believe so, or at least if someone tried it didn't work well enough that I remember :). Some of the motivation for the architecture changes in encoding patches stemmed from finding FLOP efficient ways to express relationships between byte sequences. E.G., having a long context window makes sense when dealing with tokens, but you don't need as long as an attention window if you're attending byte sequences to make patch representations, since the patch representations will implicitly be part of a longer context window in terms of number of patches.
Interesting. I would have thought one of those "minimum viable" RNNs (like https://arxiv.org/abs/2410.01201) would have been ideal for this. I might tinker a bit with this :-)
If I understand your question right, this is one of the reasons BPE is nice and the parent liked it. For any character sequence, provided the characters are in the alphabet used to create the BPE vocab, there are no unknown words/sequences. One downside of some previous tokenization methods is you could have unknown/UNK tokens, EG dictionary based methods.
In our paper with bytes, we also avoid the UNK issue, since we can have an embedding for every possible byte, since itβs not that many (and for sequences of bytes we use hash embedding, although we did test n-gram lookups for the top K frequent byte n-grams in the training data).