I know we don't have access to the details at OpenAI - but it does seem like the...

danieldk · on April 15, 2023

BPE is not set to a certain length, but a target vocabulary size. It starts with bytes (or characters) as the basic unit in which everything is split up and merges units iteratively (choosing the most frequent pairing) until the vocab size is reached. Even 'old' BPE models contain plenty of full tokens. E.g. RoBERTa:

https://huggingface.co/roberta-base/raw/main/merges.txt

(You have to scroll down a bit to get to the larger merges and image the lines without the spaces, which is what a string would look like after a merge.)

Also see GPT-2:

https://huggingface.co/gpt2/raw/main/merges.txt

I recently did some statistics. Average number of pieces per token (sampled on fairly large data, these are all models that use BBPE):

RoBERTa base (English): 1.08

RobBERT (Dutch): 1.21

roberta-base-ca-v2 (Catalan): 1.12

ukr-models/xlm-roberta-base-uk (Ukrainian): 1.68

In all these cases, the median token length in pieces was 1.

(Note: I am not debating that newer OpenAI models don't use a larger vocab. I just want to show that older BBPE models didn't use 3 char pieces. They were 1 piece per token for most tokens.)

montebicyclelo · on April 15, 2023

OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909