Has anyone ever tried a GPT trained on, say, 256 tokens representing bytes in a ...

PeterisP · on June 9, 2023

Sure, the concept has been explored; for example see the classic 2015 Karpathy's http://karpathy.github.io/2015/05/21/rnn-effectiveness/ as a cool description of a character-level model.

IIRC the early papers on subword tokenization also sometimes included explicit comparisons with character-level models, but people don't do it nowadays because there's a clear consensus on the expected outcome - yes, it works, but it's simply worse.

Technically it's the exact outcome that you get if you put in a vocabulary size of 256 (and do tokenization on byte-level, not unicode), so it's just an extreme case of vocabulary size choice, and there's enough research on how vocabulary size affects stuff to assume that 256 is not an optimal size.

You can do it for exploring capabilities though - see "Bytes is all you need" https://news.ycombinator.com/item?id=36176756 discussion on trying to abstract away complex file formats by just passing the bytes of the file to the neural network directly - again, it obviously works worse, but it kind of works.

sebzim4500 · on June 8, 2023

I'm sure it would work, but there are obvious downsides (slower and less history) with few upsides (simpler, no glitch tokens)

sanxiyn · on June 9, 2023

Yes, "ByT5: Towards a token-free future with pre-trained byte-to-byte models" for example. https://arxiv.org/abs/2105.13626

belladoreai · on June 9, 2023

Tangentially related to what you ask: LLaMA tokenizer has fallback to byte-level tokens.

sandinmyjoints · on June 8, 2023

Not a GPT, but I think Megabyte does that.