Hacker News new | past | comments | ask | show | jobs | submit login

I know we don't have access to the details at OpenAI - but it does seem like there have been significant changes to the BPE token size over time. It seems there is a push towards much larger tokens than the previous ~3 char tokens (at least by behavior)



BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:

https://huggingface.co/roberta-base/raw/main/merges.txt

(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)

Also see GPT-2:

https://huggingface.co/gpt2/raw/main/merges.txt

I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):

RoBERTa base (English): 1.08

RobBERT (Dutch): 1.21

roberta-base-ca-v2 (Catalan): 1.12

ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68

In all these cases, the median token length in pieces was 1.

(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)


OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: