> The token lists are just text files that consist of the characters base64 enco...

mike_hearn · on June 9, 2023

Yes, absolutely. TikToken is quite heavily optimized. If I wanted to write a tokenizer I'd just use their Rust backend and invoke it via an FFI, or translate it mechanically into another language. Actually, GPT-4 is quite good at code language translation so I'd just ask it to do the work.

belladoreai · on June 9, 2023

TikToken doesn't provide a tokenizer that's compatible with LLaMA.

mike_hearn · on June 9, 2023

Ah interesting. What's the difference? Isn't it just finding the minimal mapping of character sequences to numbers?

belladoreai · on June 9, 2023

That's what I thought when I started working on this, but it turns out the answer is no! This approach - "minimal mapping of character sequences to numbers" - can be described as a greedy algorithm. Using this approach will often produce correct results, but not always. I can provide an example:

Input string: " grabbed"

Tokenize that with the greedy algorithm, you get [17229, 2580] == [" grab", "bed"]

Tokenize that with actual LLaMA tokenizer, you get [2646, 1327, 287] == [" gra", "bb", "ed"]

Note that the correct tokenizer represents this string with 3 tokens, even though it would be more efficient to represent this string with 2 tokens (yes, those 2 tokens exist in the vocabulary).

LLaMA uses SentencePiece Byte-Pair Encoding for tokenization, and it has many weird quirks like this.

hammeiam · on June 9, 2023

Would you be willing to share a GitHub link? This seems like a fun project to read through.

belladoreai · on June 11, 2023

Published now: https://github.com/belladoreai/llama-tokenizer-js

belladoreai · on June 9, 2023

Sure. Check back to this comment after the weekend. I will post it here.