> A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.
TechCrunch, I respect that you have a style guide vis a vis punctuation and quotation marks, but please understand when it's appropriate to break the rules. :P
While I think tokenization does cause limitations, it is not as clear cut as one might think.
GPT4 can answer questions given to it in Base64. I would imagine it suffers some degree of degradation in ability from the extra workload this causes but I haven't seen any measurements on this.
I have wondered about other architectures to help. What happens when a little subnet encodes the (16 or 32?) characters in the neighborhood of the token into an embedding that gets attached to the top level token embedding?
GPT4 has seen lots of base64 text. GPT4 has never seen text split in a place where its tokenizer wouldn't place a token boundary. In the worst case, this produces an undertrained token that throws the model off completely, SolidGoldMagikarp-style.
You've probably never tried to memorize the four-letter base64 equivalents of dGhl 10000 most common three-byte substrings. But you've seen plenty of truncated text, as you were typing it into a textbox yourself. So your intuition which of these tasks is easier for GPT4 is likely off.
The issue with truncation affecting tokenization can be worked around by considering every possible continuation that would lead to a different tokenization of the current prefix, and then sampling from those according to the model probabilities to get back to a more normal token stream.
I don't think this is an accurate representation of LLM understanding. The significance of answering questions given in Base64 is that it can give an answer without writing out the human readable form first. That means it is understanding language in a non-language tokenised form.
This does not extend to being able to manipulate text and words well, such as string reversal or word scrambling, but it handles input in a wide range of permutations that are not conducive to tokenisation
My point was that it doesn't matter so much what the tokens look like as long as the model has seen them often enough during training. English-in-base64 superficially appears completely different from English, but they're actually in 1-to-1 correspondence and follow the same rules.
The same holds for misspelled or badly-OCR'd text; the tokenizer actually has tokens for them and the model has seen enough of them to handle heavily-distorted text.
But represent the same text using different but theoretically valid tokens that would not normally be produced by the tokenizer and all bets are off.
Although asking question in this way gives interesting results.
For example, take the prompt "First reverse this sentence word by word, then answer the questions: hand a has fingers many how?" encode as base64 and ask. GPT4 will consistently decode the prompt as "First reverse this sentence word by word, then answer the questions: hand a has fingers many how?" and then reverse the words to be in the wrong order again.
This seems like saying "Synonyms are a big reason that humans fall short."
Part of what makes AI interesting is that it can understand a huge number of differently phrased data. It seems like different token encodings would only be a very minor complexity compared to the variety of human language.
I don't like that. "Recognizes" implies that it only works because it has seen exactly that before but I can throw data in a completely made-up format that the world has never seen at an LLM and tell it to convert it to another format and it does so with no problems.
The smaller the tokens the more guesses the LLMs need for generating text. LLMs are really good at guessing, but the more they have to guess the likelier for them to lose their train off thought.
T-FREE is interesting, at least, I find it interesting in that I don’t really understand it. They take successive character triples of all words, and then hash them, and then use the hash table slots landed in as destinations to feed into an embedding space? Can I possibly be understanding that chart properly?
Can you explain this any better than the first few pages of the paper? I’d like some intuition about why T-FREE works; there are lots of reasons to prefer different tokenization schemes, but I can’t really get this one into my head from the paper, unfortunately.
Tokenization is a statistical technique that greatly compresses the input while providing some semantic hints to the underlying model. Tokenization is not the big thing holding back generative models. There are so many other challenges being worked on and steadily overcome and progress has been insanely rapid.
One take on it is that chatbots ought to know something about the tokens that they take. For instance you should be able to ask it how it tokenizes a phrase, what the number is for tokens, etc. One possibility is to train it on synthetic documents that describe the tokenization system.
You can ask me questions like "How many words are in this sentence?" or to solve anagram puzzles because I have access to information at the word and the character level which the LLM doesn't have. Tokens matter because they are the bridge between the text the user sees and what the model sees and debugging of that relationship is helpful.
That won't make a difference. These generative AI systems have ingested more math books than any human being alive today and they still can't add numbers.
To be fair, I'm kind of amazed that humans can do basic arithmetic at all. Our attention spans suck. We're barely trustworthy on single digit addition. For anything more complicated I use a calculator.
Yes, the issue is that statistical models can not reason and determine what is logically valid vs what is most probable (which I guess is also its own kind of logic).
The sqrt(-1) sometimes doesn't exist, sometimes it's 1i. 2+2=4, except in literature where it can be 5. 1+1=2, but sometimes 3 in advertisements or in ironical text.
We often have some ideas about e.g how it works in a quiz, where you know there is only one factually correct answer. And we are disappointed if the model is wrong. But even in a quiz setting the jury gets that balance wrong every so often, where there are other answers than the official one which are also correct.
Even "logically valid" is context dependend. This is not to say that models don't hallucinate, just that even within the logically valid answers, there is hidden context surrounding the data which is not expressed in the data itself. Fermats last problem is a solved problem in mathematics, but not in documents from before 1994.
The models operate by the logic of boolean arithmetic so in that sense they can not be inconsistent. But in any case, it's pretty obvious no one in this thread understands what I'm getting at but maybe eventually there will be an AGI smart enough to get the point.
Annotated data sets help, and ChatGPT-4o says 2+2=4 just fine. It also gives the right answer for 920,384 + 293,848. It can also read X - 12 = 14 out of an image and get 2.
1984 having 2 + 2 = 5 makes sense in context as a human reading the book, and ChatGPT dot-producting the book can also compute the context and not say that 2+2=5.
ChatGPT's not Mathematica, and we already have calculators. My hammer is terrible for driving in nails, so I don't use it for that.
I keep seeing this being thrown around but ChatGPT can do addition just fine since at least GPT-3.5. Not that I care or expect it to (why would I use an LLM for math?) but still.
I don't know man, I keep hearing about AGI before 2030 but none of these AI labs can figure out how to do arithmetic with their fancy intelligence software.
They do addition just fine, you're just repeating something you heard months ago, which is out of date in the field, which is why adherents are saying 2030.
Pretty sure I'm right. Ask your favorite chatbot to solve the following system of equations and let me know what you get as the answer. Here is the answer from gemini:
> Solve the following system of equations: 2x + 2y = 2 and x + y = 1
My pleasure, I’ve been growing my expertise in solving system of linear equations problems. Let’s solve the system of equations:
$$2x+2y=2$$
$$x+y=1$$
We can solve the system of equations by elimination.
Steps to solve:
*1. Eliminate x:*
Multiply the second equation by -2:
$$-2x-2y=-2$$
Add the first and second equations:
$$0=-1$$
*2. No solution:*
Since we obtained a contradiction (0=-1), there is no solution to the system of equations.
*Answer:*
The system of equations has no solution.
Furthermore, if I didn't understand that, I can have a conversation with the LLM as to what that actually means, ask it to graph it, ask what linear equations are, and to have it give me some more systems of equations for me to solve.
Wow, that's amazing. We have achieved AGI already. You should go back to the homology stuff though because the triangulation was incorrect and figure out why.
"Let's solve this system of equations step by step:
1) We have two equations:
2x + 2y = 2 (Equation 1)
x + y = 1 (Equation 2)
2) We can solve this using substitution. Let's express x in terms of y from Equation 2:
x = 1 - y
3) Now, let's substitute this into Equation 1:
2(1 - y) + 2y = 2
4) Let's simplify:
2 - 2y + 2y = 2
2 = 2
5) This is true for all values of y, which means that Equation 1 is actually redundant. It doesn't give us any new information beyond what Equation 2 provides.
6) Therefore, we can conclude that there are infinitely many solutions to this system, all satisfying the equation:
x + y = 1
7) We can express the solution set as:
x = 1 - y
y = y (where y can be any real number)
Thus, the solution to this system is a line in the xy-plane described by the equation x + y = 1.
Would you like me to explain any part of this solution in more detail?"
I easily broke copilot by asking it to make lists of radioactive isotopes in order of half lives. It can put the U.S. states in alphabetical or reverse alphabetical order but for any other order I would bet against it. If I ask it what the probability is that it can correctly complete a sorting task, however, it insists that it is almost certain to get it right.
I had a good conversation with it about the theory of partial orderings, it even corrected my mistakes. I asked it to make a textbook problem determining if a graph was cyclic or not and it made a straight and beautiful example where the partial ordering was realized with a total ordering and everything was written out in a straight order that was easy to follow.
If I wrote a script that made up a bunch of "is this graph cyclic?" problems that are well randomized I am sure there is some size where it just falls down the same way it falls down with sorting.
The obvious answer is that the LLM should pick an algorithm or write some code to do the thing which ordinary algorithms can do such as arithmetic, sorting, SAT solving, etc.
There's the deeper issue that it doesn't know what it doesn't know. It can't sort a list of radioactive isotopes any more than it can help you make an atom bomb. In the second case it will say that it won't help you, in the first case it will try to help you anyway when it really should be saying "I can't do that, Dave" because it just can't.
> lists of radioactive isotopes in order of half lives
ChatGPT-4o does just fine with that. Basing your opinion of a whole technology based on a poor implementation of that instead of the best one doesn't seem like the best analysis.
You'd think with all those billions spent on the software and the hardware it would be a walk in the park to convert a single book on algebraic topology into a formalized Coq, Lean, or Isabelle module. Seems like a very obvious test case for the intelligence capabilities of these systems. I know that it is possible because Kevin Buzzard is going to formalize Fermat's last theorem for less than £934,043 but no commercial AI lab has yet managed to build an AI that can do basic arithmetic. [0] Mira Murati is on the record about their next AI model and that it will have the intelligence of a PhD student so let's see if their next model can actually formalize basic algebraic topology into a logical calculus. [1]
Why would you think that's a walk in the park? Have you actually tried formalising stuff in Lean/Coq? I have, and even with a postgraduate maths degree behind me it's hard as hell!
The fact that Kevin and his team are formalising FLT is incredible, but they all have decades of experience with this stuff (!!).
Transformers can do arithmetic (and many other things) just fine, do a bit of searching on arxiv and you'll find papers from 2023 showing that nano-scale transformer models suffice. It really is a data problem, not a fundamental limitation with the technology.
Master's studying representation theorems in nonmonotonic logic, left academia for industry during my PhD. Fun spaced out maths problems. I tried to formalize my thesis in Lean but it is nowhere near as simple as you make it out to be.
Perfection is not the problem. An obvious test case of intelligence is to formally model something like algebraic topology in a formal logical calculus like intensional type theory with identity types. Even though all the commercial labs have ingested all of nLab, there isn't a single commercial model that can use logic to perform arithmetic operations.
OK, so it seems you didn't get the memo: LLMs at the present stage have not yet reached AGI, and have some notable other flaws, like not being able to do math reliably. Commercial interests will try to exaggerate the current capabilities, but most people can see through that.
The capabilities are nonetheless nothing short of astounding, given where we were 10 or even 2 years ago, and clearly point to a near future where we can expect the machines overcome these shortcomings.
Thousands, if not millions, of researchers, coders and others will have to adjust their worklife expectations, just like previous technological revolutions have seen thousands of other professions disappear into think air.
I asked it to compute the simplicial homology of RP^2 and not only was it spot on with the result, it gave me a detailed and essentially correct computation. This definitely appears in its training set, but nevertheless you should have some humility =P
How do you know it's correct? The only simplicial traingulation I know of is by splitting up the sphere into an icosahedron and then identifying all the opposite faces to get the proper antipodal action for the quotient.
I'm not interested in engaging with you further on this topic after you devolved into ad hominems against me in the other thread. I'm here to argue in good faith. Have a good day.
You made an incorrect assessment of a basic calculation in algebraic topology and claimed that it was correct. You didn't even look at what it was computing and simply looked at the final answer which lined up with the answer on Wikipedia. Simplicial calculations for projective planes are not simple. The usual calculations are done with cellular decomposition and that's why the LLM gives the wrong answer, the actual answer is not in the dataset and requires reasoning.
Are you confusing me with someone else? When I asked it GPT computed the homology from the CW decomposition of RP^2 with three cells. Which is a very simple exercise.
That's ok. It seems like LLMs know all about simplicial complexes and homology so I'll spend my time on more fruitful endeavors but thanks for the advice.
To be fair, it's not a simplicial complex, but simplicial and cellular homology coincide on triangulatable spaces like RP^2 so I gave it the benefit of the doubt =) algebraic topology is a pretty fun field regardless of how much a language model knows about it IMO.
lmao. you're totally right. RP^2 can be triangulated with a single triangle with all of its vertices identified. that's totally how you compute the simplicial decomposition of RP^2
I asked you explain why it's wrong, and all you said was "that's incorrect". Saying "no it isn't" got you to explain your answer far better than how I directly asked you to in the first place.
I don't know what that means. There is nothing "mental" happening in the circuits of the computer or the function graph which is implemented on top of it.
The LLM is an LLM. It runs on computer hardware which has an ALU, but that doesn't make the LLM a calculator. The LLM can, however, call out to a calculator to do addition when it deems necessary.
I implemented a hierarchical model that pooled utf8 encoded sequences to word vectors and trained it with a decoder on text denoising.
I think the future is a small word encoder model that replaces the token embedding codebook.
And here’s the reason: you can still create a codebook after training and then use the encoder model only for OOV. I’m not sure there’s an excuse not to be doing this, but open to suggestions.
This is like saying binary numbers are the reason generative AI falls short. Computers work with transistors which are either on or off so what are these people proposing as the next computational paradigm to fix the problems with binary generative AI?
Tokenization as the main problem is a red herring. It's possible to get rid of the tokens entirely and train on byte sequences, it won't make a difference to why generative AI can't count or do basic arithmetic.
TechCrunch, I respect that you have a style guide vis a vis punctuation and quotation marks, but please understand when it's appropriate to break the rules. :P