Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Has anyone ever tried a GPT trained on, say, 256 tokens representing bytes in a byte stream or even more simply binary digits?

I imagine there are efficiency trade-offs but I just wonder if it works at all.



Sure, the concept has been explored; for example see the classic 2015 Karpathy's http://karpathy.github.io/2015/05/21/rnn-effectiveness/ as a cool description of a character-level model.

IIRC the early papers on subword tokenization also sometimes included explicit comparisons with character-level models, but people don't do it nowadays because there's a clear consensus on the expected outcome - yes, it works, but it's simply worse.

Technically it's the exact outcome that you get if you put in a vocabulary size of 256 (and do tokenization on byte-level, not unicode), so it's just an extreme case of vocabulary size choice, and there's enough research on how vocabulary size affects stuff to assume that 256 is not an optimal size.

You can do it for exploring capabilities though - see "Bytes is all you need" https://news.ycombinator.com/item?id=36176756 discussion on trying to abstract away complex file formats by just passing the bytes of the file to the neural network directly - again, it obviously works worse, but it kind of works.


I'm sure it would work, but there are obvious downsides (slower and less history) with few upsides (simpler, no glitch tokens)


Yes, "ByT5: Towards a token-free future with pre-trained byte-to-byte models" for example. https://arxiv.org/abs/2105.13626


Tangentially related to what you ask: LLaMA tokenizer has fallback to byte-level tokens.


Not a GPT, but I think Megabyte does that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: