Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

A few extra notes on tokens.

You don't have to use tiktoken if you aren't actually tokenizing things. The token lists are just text files that consist of the characters base64 encoded followed by the numeric ID. If you want to explore the list you can just download them and decode them yourself.

I find that sorting tokens by length makes it a bit easier to get a feel for what's in there.

GPT-4 has a token vocabulary about twice the size of GPT-3.5.

The most interesting thing to me about the GPT-4 token list is how dominated it is by non-natural languages. It's not as simple as English tokenizing more efficiently than Spanish because of frequency. The most common language after English is code. A huge number of tokens are allocated to even not very common things found in code, like "ValidateAntiForgeryToken" or "_InternalArray". From eyeballing the list I'd guess about half the tokens seem to be from source code.

My guess is that it's not a coincidence that GPT-4 both trained on a lot of code and is also the leading model. I suspect we're going to discover at some point, or maybe OpenAI already did, that training on code isn't just a neat trick to get an LLM that can knock out scripts. Maybe it's fundamentally useful to train the model to reason logically and think clearly. The highly structured and unambiguous yet also complex thought that code represents is probably a great way for the model to really level up its thought processes. Ilya Sutskever mentioned in an interview that one of the bottlenecks they face on training something smarter than GPT-4 is getting access to "more complex thought". If this is true then it's possible the Microsoft collaboration will prove an enduring competitive advantage for OpenAI, as it gives them access to the bulk GitHub corpus which is probably quite hard to scrape otherwise.



Here is the list of the 100k GPT-4 tokens as text file.

https://gist.github.com/s-macke/ae83f6afb89794350f8d9a1ad8a0...

Yes, a lot of tokens are just for code.

Edit: Here as raw link for the poor mobile devices:

https://gist.githubusercontent.com/s-macke/ae83f6afb89794350...


Where are all the tokens for other languages though? How are these getting tokenized?


Thanks for this! That was very nice and thoughtful.

There’s something poetic about ULL being a token, but NULL not being one.


Saw many words missing their first letter. Realized it’s probably because it’s sometimes Null and sometimes null


ULL is also used for designating Unsigned Long Long 64bit ints in c++, so it’s not just a part of the null symbol.


>I suspect we're going to discover at some point, or maybe OpenAI already did, that training on code isn't just a neat trick to get an LLM that can knock out scripts.

This is a thing that's already fairly well known

https://arxiv.org/abs/2210.07128


Thanks for the link. That paper seems a bit different though. They're asking the model to do reasoning by emitting serialized graphs using a custom declarative data format, which it struggles with of course because it hasn't seen any such format before. Then they switch to asking it to emit code and it does better. But what I was meaning was more that code training helps it reason and speak better even in English, where no code is being emitted at all.


To be fair Codex was much better than GPT-3 on reasoning benchmarks like MMLU and the like. And people have kind of noticed the Code trained models to reason better. Don't know if a paper was published about that though.


Thought can be seen as a process that encompasses both rational and irrational thinking. Rational thought, in programming languages, involves precise logic, determinism, and the ability to simulate outcomes. On the other hand, human language, like English, embraces subjective interpretation and approximations, allowing for the expression of emotions and nuanced understanding.

Thought, as a cognitive process, can bridge the gap between these two realms, enabling individuals to move back and forth between rational and irrational modes of thinking, depending on the context and objectives at hand.

With data, unstructured text could be considered "irrational" and structured text (like code or a column in a database) could be considered "rational".


I saw this when 3.5 came out: https://yaofu.notion.site/How-does-GPT-Obtain-its-Ability-Tr...

Haven't followed up on all the comments in it but speculates on why chain of thought improves when training on code.


> Yet whether such reasoning should be done by a language model or a symbolic system is up for discussion. For example, instead of trying hard to make GPT do three digits addition, one might simply call Python.


> The token lists are just text files that consist of the characters base64 encoded followed by the numeric ID. If you want to explore the list you can just download them and decode them yourself.

If you want to look at mappings for individual tokens, sure, but if you actually want to tokenize text that contains more than 1 token, the process is very non trivial. I've been writing my own JavaScript LLaMA tokenizer for several days now (just the encode and decode functions, not training). I hope to release it this weekend. It's currently 400 lines of code + data.


Yes, absolutely. TikToken is quite heavily optimized. If I wanted to write a tokenizer I'd just use their Rust backend and invoke it via an FFI, or translate it mechanically into another language. Actually, GPT-4 is quite good at code language translation so I'd just ask it to do the work.


TikToken doesn't provide a tokenizer that's compatible with LLaMA.


Ah interesting. What's the difference? Isn't it just finding the minimal mapping of character sequences to numbers?


That's what I thought when I started working on this, but it turns out the answer is no! This approach - "minimal mapping of character sequences to numbers" - can be described as a greedy algorithm. Using this approach will often produce correct results, but not always. I can provide an example:

Input string: " grabbed"

Tokenize that with the greedy algorithm, you get [17229, 2580] == [" grab", "bed"]

Tokenize that with actual LLaMA tokenizer, you get [2646, 1327, 287] == [" gra", "bb", "ed"]

Note that the correct tokenizer represents this string with 3 tokens, even though it would be more efficient to represent this string with 2 tokens (yes, those 2 tokens exist in the vocabulary).

LLaMA uses SentencePiece Byte-Pair Encoding for tokenization, and it has many weird quirks like this.


Would you be willing to share a GitHub link? This seems like a fun project to read through.



Sure. Check back to this comment after the weekend. I will post it here.


If anyone else is looking for the list parent is mentioning I assume it is: https://openaipublic.blob.core.windows.net/encodings/r50k_ba...


One thing I find fascinating about GPT-4 (and I'm curious about your take) is that it can not only generate novel, non-trivial code, but it can (upon request) output that code as a base64 encoded string... seemingly all from the model itself.


I don't have a great explanation for that, other than the obvious trivial one that it must have seen a lot of base64 encoded text alongside the decoded text in its training set, and that was sufficient for a small part of the network to learn how to decode it. If you look at visualizations of smaller RNNs trained on code then you can identify neurons that activate for things like "inside a quoted string", "inside the expression of an if statement" and so on.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: