OpenAI have made their tokenizers public [1]. As someone has pointed out, with B...

OpenAI have made their tokenizers public [1].

As someone has pointed out, with BPE you specify the vocab size, not the token size. It's a relatively simple algo, this Huggingface course does a nice job of explaining it [2]. Plus the original paper has a very readable Python example [3].

[1] https://github.com/openai/tiktoken

[2] https://huggingface.co/course/chapter6/5?fw=pt

[3] https://arxiv.org/abs/1508.07909