> The token lists are just text files that consist of the characters base64 encoded followed by the numeric ID. If you want to explore the list you can just download them and decode them yourself.
If you want to look at mappings for individual tokens, sure, but if you actually want to tokenize text that contains more than 1 token, the process is very non trivial. I've been writing my own JavaScript LLaMA tokenizer for several days now (just the encode and decode functions, not training). I hope to release it this weekend. It's currently 400 lines of code + data.
Yes, absolutely. TikToken is quite heavily optimized. If I wanted to write a tokenizer I'd just use their Rust backend and invoke it via an FFI, or translate it mechanically into another language. Actually, GPT-4 is quite good at code language translation so I'd just ask it to do the work.
That's what I thought when I started working on this, but it turns out the answer is no! This approach - "minimal mapping of character sequences to numbers" - can be described as a greedy algorithm. Using this approach will often produce correct results, but not always. I can provide an example:
Input string: " grabbed"
Tokenize that with the greedy algorithm, you get [17229, 2580] == [" grab", "bed"]
Tokenize that with actual LLaMA tokenizer, you get [2646, 1327, 287] == [" gra", "bb", "ed"]
Note that the correct tokenizer represents this string with 3 tokens, even though it would be more efficient to represent this string with 2 tokens (yes, those 2 tokens exist in the vocabulary).
LLaMA uses SentencePiece Byte-Pair Encoding for tokenization, and it has many weird quirks like this.
If you want to look at mappings for individual tokens, sure, but if you actually want to tokenize text that contains more than 1 token, the process is very non trivial. I've been writing my own JavaScript LLaMA tokenizer for several days now (just the encode and decode functions, not training). I hope to release it this weekend. It's currently 400 lines of code + data.