A few extra notes on tokens. You don't have to use tiktoken if you aren't actual...

s-macke · on June 9, 2023

Here is the list of the 100k GPT-4 tokens as text file.

https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...

Yes, a lot of tokens are just for code.

Edit: Here as raw link for the poor mobile devices:

https://gist.githubusercontent.com/s-macke/ae83f6afb89794350...

josecyc · on June 23, 2023

Where are all the tokens for other languages though? How are these getting tokenized?

sillysaurusx · on June 9, 2023

Thanks for this! That was very nice and thoughtful.

There’s something poetic about ULL being a token, but NULL not being one.

thyselius · on June 9, 2023

Saw many words missing their first letter. Realized it’s probably because it’s sometimes Null and sometimes null

hammeiam · on June 9, 2023

ULL is also used for designating Unsigned Long Long 64bit ints in c++, so it’s not just a part of the null symbol.

famouswaffles · on June 9, 2023

>I suspect we're going to discover at some point, or maybe OpenAI already did, that training on code isn't just a neat trick to get an LLM that can knock out scripts.

This is a thing that's already fairly well known

https://arxiv.org/abs/2210.07128

mike_hearn · on June 9, 2023

Thanks for the link. That paper seems a bit different though. They're asking the model to do reasoning by emitting serialized graphs using a custom declarative data format, which it struggles with of course because it hasn't seen any such format before. Then they switch to asking it to emit code and it does better. But what I was meaning was more that code training helps it reason and speak better even in English, where no code is being emitted at all.

famouswaffles · on June 9, 2023

To be fair Codex was much better than GPT-3 on reasoning benchmarks like MMLU and the like. And people have kind of noticed the Code trained models to reason better. Don't know if a paper was published about that though.

kordlessagain · on June 9, 2023

Thought can be seen as a process that encompasses both rational and irrational thinking. Rational thought, in programming languages, involves precise logic, determinism, and the ability to simulate outcomes. On the other hand, human language, like English, embraces subjective interpretation and approximations, allowing for the expression of emotions and nuanced understanding.

Thought, as a cognitive process, can bridge the gap between these two realms, enabling individuals to move back and forth between rational and irrational modes of thinking, depending on the context and objectives at hand.

With data, unstructured text could be considered "irrational" and structured text (like code or a column in a database) could be considered "rational".

base698 · on June 8, 2023

I saw this when 3.5 came out: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

Haven't followed up on all the comments in it but speculates on why chain of thought improves when training on code.

kordlessagain · on June 9, 2023

> Yet whether such reasoning should be done by a language model or a symbolic system is up for discussion. For example, instead of trying hard to make GPT do three digits addition, one might simply call Python.

belladoreai · on June 9, 2023

> The token lists are just text files that consist of the characters base64 encoded followed by the numeric ID. If you want to explore the list you can just download them and decode them yourself.

If you want to look at mappings for individual tokens, sure, but if you actually want to tokenize text that contains more than 1 token, the process is very non trivial. I've been writing my own JavaScript LLaMA tokenizer for several days now (just the encode and decode functions, not training). I hope to release it this weekend. It's currently 400 lines of code + data.

mike_hearn · on June 9, 2023

Yes, absolutely. TikToken is quite heavily optimized. If I wanted to write a tokenizer I'd just use their Rust backend and invoke it via an FFI, or translate it mechanically into another language. Actually, GPT-4 is quite good at code language translation so I'd just ask it to do the work.

belladoreai · on June 9, 2023

TikToken doesn't provide a tokenizer that's compatible with LLaMA.

mike_hearn · on June 9, 2023

Ah interesting. What's the difference? Isn't it just finding the minimal mapping of character sequences to numbers?

belladoreai · on June 9, 2023

That's what I thought when I started working on this, but it turns out the answer is no! This approach - "minimal mapping of character sequences to numbers" - can be described as a greedy algorithm. Using this approach will often produce correct results, but not always. I can provide an example:

Input string: " grabbed"

Tokenize that with the greedy algorithm, you get [17229, 2580] == [" grab", "bed"]

Tokenize that with actual LLaMA tokenizer, you get [2646, 1327, 287] == [" gra", "bb", "ed"]

Note that the correct tokenizer represents this string with 3 tokens, even though it would be more efficient to represent this string with 2 tokens (yes, those 2 tokens exist in the vocabulary).

LLaMA uses SentencePiece Byte-Pair Encoding for tokenization, and it has many weird quirks like this.

hammeiam · on June 9, 2023

Would you be willing to share a GitHub link? This seems like a fun project to read through.

belladoreai · on June 11, 2023

Published now: https://github.com/belladoreai/llama-tokenizer-js

belladoreai · on June 9, 2023

Sure. Check back to this comment after the weekend. I will post it here.

weinzierl · on June 8, 2023

If anyone else is looking for the list parent is mentioning I assume it is: https://openaipublic.blob.core.windows.net/encodings/r50k_ba...

resters · on June 9, 2023

One thing I find fascinating about GPT-4 (and I'm curious about your take) is that it can not only generate novel, non-trivial code, but it can (upon request) output that code as a base64 encoded string... seemingly all from the model itself.

mike_hearn · on June 10, 2023

I don't have a great explanation for that, other than the obvious trivial one that it must have seen a lot of base64 encoded text alongside the decoded text in its training set, and that was sufficient for a small part of the network to learn how to decode it. If you look at visualizations of smaller RNNs trained on code then you can identify neurons that activate for things like "inside a quoted string", "inside the expression of an if statement" and so on.