Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Tokens are a big reason today's generative AI falls short (techcrunch.com)
31 points by anigbrowl on July 6, 2024 | hide | past | favorite | 103 comments


> A tokenizer might encode “once upon a time” as “once,” “upon,” “a,” “time,” for example, while encoding “once upon a ” (which has a trailing whitespace) as “once,” “upon,” “a,” ” .” Depending on how a model is prompted — with “once upon a” or “once upon a ,” — the results may be completely different, because the model doesn’t understand (as a person would) that the meaning is the same.

TechCrunch, I respect that you have a style guide vis a vis punctuation and quotation marks, but please understand when it's appropriate to break the rules. :P


When this problem comes up in code-related documentation at work, I often fall back onto either a distinct typeface or a background-color shift.

That way it's clearer when I'm referring to a literal code-string versus quote marks that are part of the prose.


I usually use <pre></pre> for this.


<tt>

Teletype longa, vita brevis.


The <tt> tag is deprecated.


The blink and marquee tags are deprecated but they still work...


True. But it is well supported and it meets a need.


stymied by their own tokenization !


I'm glad I wasnt the only one


While I think tokenization does cause limitations, it is not as clear cut as one might think.

GPT4 can answer questions given to it in Base64. I would imagine it suffers some degree of degradation in ability from the extra workload this causes but I haven't seen any measurements on this.

I have wondered about other architectures to help. What happens when a little subnet encodes the (16 or 32?) characters in the neighborhood of the token into an embedding that gets attached to the top level token embedding?


GPT4 has seen lots of base64 text. GPT4 has never seen text split in a place where its tokenizer wouldn't place a token boundary. In the worst case, this produces an undertrained token that throws the model off completely, SolidGoldMagikarp-style.

You've probably never tried to memorize the four-letter base64 equivalents of dGhl 10000 most common three-byte substrings. But you've seen plenty of truncated text, as you were typing it into a textbox yourself. So your intuition which of these tasks is easier for GPT4 is likely off.

The issue with truncation affecting tokenization can be worked around by considering every possible continuation that would lead to a different tokenization of the current prefix, and then sampling from those according to the model probabilities to get back to a more normal token stream.


I don't think this is an accurate representation of LLM understanding. The significance of answering questions given in Base64 is that it can give an answer without writing out the human readable form first. That means it is understanding language in a non-language tokenised form.

This does not extend to being able to manipulate text and words well, such as string reversal or word scrambling, but it handles input in a wide range of permutations that are not conducive to tokenisation

Unnatural Error Correction: GPT-4 Can Almost Perfectly Handle Unnatural Scrambled Text https://arxiv.org/abs/2311.18805


My point was that it doesn't matter so much what the tokens look like as long as the model has seen them often enough during training. English-in-base64 superficially appears completely different from English, but they're actually in 1-to-1 correspondence and follow the same rules.

The same holds for misspelled or badly-OCR'd text; the tokenizer actually has tokens for them and the model has seen enough of them to handle heavily-distorted text.

But represent the same text using different but theoretically valid tokens that would not normally be produced by the tokenizer and all bets are off.


Although asking question in this way gives interesting results.

For example, take the prompt "First reverse this sentence word by word, then answer the questions: hand a has fingers many how?" encode as base64 and ask. GPT4 will consistently decode the prompt as "First reverse this sentence word by word, then answer the questions: hand a has fingers many how?" and then reverse the words to be in the wrong order again.


> GPT4 can answer questions given to it in Base64

Would have never thought ...

... wow.-


This seems like saying "Synonyms are a big reason that humans fall short."

Part of what makes AI interesting is that it can understand a huge number of differently phrased data. It seems like different token encodings would only be a very minor complexity compared to the variety of human language.


> Part of what makes AI interesting is that it can understand a huge number of differently phrased data.

I'd say it "recognizes" a huge number of differently phrased data. This avoids implying any analytic encoding.


I don't like that. "Recognizes" implies that it only works because it has seen exactly that before but I can throw data in a completely made-up format that the world has never seen at an LLM and tell it to convert it to another format and it does so with no problems.


> "Recognizes" implies that it only works because it has seen exactly that before

I don't think we use it that way with humans either. We also are capable of recognizing features (patterns) from formally new data.


The smaller the tokens the more guesses the LLMs need for generating text. LLMs are really good at guessing, but the more they have to guess the likelier for them to lose their train off thought.


An alternative approache to BPE tokenization https://arxiv.org/abs/2406.19223


T-FREE is interesting, at least, I find it interesting in that I don’t really understand it. They take successive character triples of all words, and then hash them, and then use the hash table slots landed in as destinations to feed into an embedding space? Can I possibly be understanding that chart properly?

Can you explain this any better than the first few pages of the paper? I’d like some intuition about why T-FREE works; there are lots of reasons to prefer different tokenization schemes, but I can’t really get this one into my head from the paper, unfortunately.


Can't say I mastered the concept either, I'm waiting for the code [0] to be release so I can run some head-to-head tests.

[0] https://github.com/Aleph-Alpha/trigrams


I find this article very weird.

It doesn't really explain anything besides talking about tokenization on random levels.

You need a certain amount of data to even understand that once upon a time might be a higher level concept.


Tokenization is a statistical technique that greatly compresses the input while providing some semantic hints to the underlying model. Tokenization is not the big thing holding back generative models. There are so many other challenges being worked on and steadily overcome and progress has been insanely rapid.


One take on it is that chatbots ought to know something about the tokens that they take. For instance you should be able to ask it how it tokenizes a phrase, what the number is for tokens, etc. One possibility is to train it on synthetic documents that describe the tokenization system.


> For instance you should be able to ask it how it tokenizes a phrase

Can you tell me how your brain maps input to its internal representation?

If not, why should we think its an essential feature of an AI system to do the equivalent?


You can ask me questions like "How many words are in this sentence?" or to solve anagram puzzles because I have access to information at the word and the character level which the LLM doesn't have. Tokens matter because they are the bridge between the text the user sees and what the model sees and debugging of that relationship is helpful.


That won't make a difference. These generative AI systems have ingested more math books than any human being alive today and they still can't add numbers.


I work in the field. The issue is that you will find 2+2=5 literally a hundred times more in all big datasets than 6+3=9

A lot of information is wrong, a lot is only true in context (e.g. 2+2=5 features prominent in the book 1984), even more is spam or machine generated.

If you see the garbage that goes in, I am always amazed at how these models do what they do.


To be fair, I'm kind of amazed that humans can do basic arithmetic at all. Our attention spans suck. We're barely trustworthy on single digit addition. For anything more complicated I use a calculator.


Yes, the issue is that statistical models can not reason and determine what is logically valid vs what is most probable (which I guess is also its own kind of logic).


And context, so much context.

The sqrt(-1) sometimes doesn't exist, sometimes it's 1i. 2+2=4, except in literature where it can be 5. 1+1=2, but sometimes 3 in advertisements or in ironical text.

We often have some ideas about e.g how it works in a quiz, where you know there is only one factually correct answer. And we are disappointed if the model is wrong. But even in a quiz setting the jury gets that balance wrong every so often, where there are other answers than the official one which are also correct.

Even "logically valid" is context dependend. This is not to say that models don't hallucinate, just that even within the logically valid answers, there is hidden context surrounding the data which is not expressed in the data itself. Fermats last problem is a solved problem in mathematics, but not in documents from before 1994.


The models operate by the logic of boolean arithmetic so in that sense they can not be inconsistent. But in any case, it's pretty obvious no one in this thread understands what I'm getting at but maybe eventually there will be an AGI smart enough to get the point.


Annotated data sets help, and ChatGPT-4o says 2+2=4 just fine. It also gives the right answer for 920,384 + 293,848. It can also read X - 12 = 14 out of an image and get 2.

1984 having 2 + 2 = 5 makes sense in context as a human reading the book, and ChatGPT dot-producting the book can also compute the context and not say that 2+2=5.

ChatGPT's not Mathematica, and we already have calculators. My hammer is terrible for driving in nails, so I don't use it for that.


And this is why input assessment¹ is crucial...

(Idea assessment in general. Handled ideas in thought processes are still input.)


I keep seeing this being thrown around but ChatGPT can do addition just fine since at least GPT-3.5. Not that I care or expect it to (why would I use an LLM for math?) but still.


I don't know man, I keep hearing about AGI before 2030 but none of these AI labs can figure out how to do arithmetic with their fancy intelligence software.


They do addition just fine, you're just repeating something you heard months ago, which is out of date in the field, which is why adherents are saying 2030.


Pretty sure I'm right. Ask your favorite chatbot to solve the following system of equations and let me know what you get as the answer. Here is the answer from gemini:

> Solve the following system of equations: 2x + 2y = 2 and x + y = 1

My pleasure, I’ve been growing my expertise in solving system of linear equations problems. Let’s solve the system of equations: $$2x+2y=2$$ $$x+y=1$$

We can solve the system of equations by elimination.

Steps to solve: *1. Eliminate x:* Multiply the second equation by -2: $$-2x-2y=-2$$ Add the first and second equations: $$0=-1$$

*2. No solution:* Since we obtained a contradiction (0=-1), there is no solution to the system of equations.

*Answer:* The system of equations has no solution.


ChatGPT-4o

the system of equations has infinitely many solutions given by x=t and y=1−t for any real number t.

https://chatgpt.com/share/c00f1435-a57f-49db-b8fb-9d848f7e86...

Furthermore, if I didn't understand that, I can have a conversation with the LLM as to what that actually means, ask it to graph it, ask what linear equations are, and to have it give me some more systems of equations for me to solve.


Wow, that's amazing. We have achieved AGI already. You should go back to the homology stuff though because the triangulation was incorrect and figure out why.


It is amazing, isn't it‽


You should let it plan your day. It's obviously much smarter than you.


From Claude 3.5 Sonnet:

"Let's solve this system of equations step by step:

1) We have two equations: 2x + 2y = 2 (Equation 1) x + y = 1 (Equation 2)

2) We can solve this using substitution. Let's express x in terms of y from Equation 2: x = 1 - y

3) Now, let's substitute this into Equation 1: 2(1 - y) + 2y = 2

4) Let's simplify: 2 - 2y + 2y = 2 2 = 2

5) This is true for all values of y, which means that Equation 1 is actually redundant. It doesn't give us any new information beyond what Equation 2 provides.

6) Therefore, we can conclude that there are infinitely many solutions to this system, all satisfying the equation: x + y = 1

7) We can express the solution set as: x = 1 - y y = y (where y can be any real number)

Thus, the solution to this system is a line in the xy-plane described by the equation x + y = 1.

Would you like me to explain any part of this solution in more detail?"


Wow, incredible. No one needs to learn math anymore, AI can do it for us now.


Oh absolutely, what a tragedy! How will we ever cope without the joy of solving soloist11's dumb as fuck linear equation? Ahah


You should marry an AI and not worry about my dumb linear equations. That way the AI can do everything for you, it can even think for you.


I just asked both Claude and ChatGPT to add numbers and they both gave me the right answer.


Ask them to compute the simplicial homology of the n dimensional projective plane next.


I easily broke copilot by asking it to make lists of radioactive isotopes in order of half lives. It can put the U.S. states in alphabetical or reverse alphabetical order but for any other order I would bet against it. If I ask it what the probability is that it can correctly complete a sorting task, however, it insists that it is almost certain to get it right.

I had a good conversation with it about the theory of partial orderings, it even corrected my mistakes. I asked it to make a textbook problem determining if a graph was cyclic or not and it made a straight and beautiful example where the partial ordering was realized with a total ordering and everything was written out in a straight order that was easy to follow.

If I wrote a script that made up a bunch of "is this graph cyclic?" problems that are well randomized I am sure there is some size where it just falls down the same way it falls down with sorting.

The obvious answer is that the LLM should pick an algorithm or write some code to do the thing which ordinary algorithms can do such as arithmetic, sorting, SAT solving, etc.

There's the deeper issue that it doesn't know what it doesn't know. It can't sort a list of radioactive isotopes any more than it can help you make an atom bomb. In the second case it will say that it won't help you, in the first case it will try to help you anyway when it really should be saying "I can't do that, Dave" because it just can't.


> lists of radioactive isotopes in order of half lives

ChatGPT-4o does just fine with that. Basing your opinion of a whole technology based on a poor implementation of that instead of the best one doesn't seem like the best analysis.


I just asked ChatGPT that and it seemed to come up with a good answer.


Now ask it to convert the computation into a logical calculus so that it can be verified with a theorem prover like Coq, Lean, or Isabelle.


What are you actually getting at with this? Formalising algebraic topology in Lean is still _very_ much an open project, for instance.


You'd think with all those billions spent on the software and the hardware it would be a walk in the park to convert a single book on algebraic topology into a formalized Coq, Lean, or Isabelle module. Seems like a very obvious test case for the intelligence capabilities of these systems. I know that it is possible because Kevin Buzzard is going to formalize Fermat's last theorem for less than £934,043 but no commercial AI lab has yet managed to build an AI that can do basic arithmetic. [0] Mira Murati is on the record about their next AI model and that it will have the intelligence of a PhD student so let's see if their next model can actually formalize basic algebraic topology into a logical calculus. [1]

0: https://gow.epsrc.ukri.org/NGBOViewGrant.aspx?GrantRef=EP/Y0...

1: https://engineering.dartmouth.edu/news/openai-cto-mira-murat...


Why would you think that's a walk in the park? Have you actually tried formalising stuff in Lean/Coq? I have, and even with a postgraduate maths degree behind me it's hard as hell!

The fact that Kevin and his team are formalising FLT is incredible, but they all have decades of experience with this stuff (!!).

Transformers can do arithmetic (and many other things) just fine, do a bit of searching on arxiv and you'll find papers from 2023 showing that nano-scale transformer models suffice. It really is a data problem, not a fundamental limitation with the technology.


What is your degree in?


Master's studying representation theorems in nonmonotonic logic, left academia for industry during my PhD. Fun spaced out maths problems. I tried to formalize my thesis in Lean but it is nowhere near as simple as you make it out to be.


You are really onto something here. LLMs aren't really perfect. Good catch.


Perfection is not the problem. An obvious test case of intelligence is to formally model something like algebraic topology in a formal logical calculus like intensional type theory with identity types. Even though all the commercial labs have ingested all of nLab, there isn't a single commercial model that can use logic to perform arithmetic operations.


OK, so it seems you didn't get the memo: LLMs at the present stage have not yet reached AGI, and have some notable other flaws, like not being able to do math reliably. Commercial interests will try to exaggerate the current capabilities, but most people can see through that.

The capabilities are nonetheless nothing short of astounding, given where we were 10 or even 2 years ago, and clearly point to a near future where we can expect the machines overcome these shortcomings.

Thousands, if not millions, of researchers, coders and others will have to adjust their worklife expectations, just like previous technological revolutions have seen thousands of other professions disappear into think air.


Sure, good luck with this AGI business. I'm sure it will work out great for everyone in the end.


The output of ChatGPT almost always sounds good. That's the point.

But I would wager that its answer was at least wrong, and perhaps total nonsense.

That's the real hazard of using ChatGPT as a learning tool. You are in no position to evaluate whether the output makes any sense.


I asked it to compute the simplicial homology of RP^2 and not only was it spot on with the result, it gave me a detailed and essentially correct computation. This definitely appears in its training set, but nevertheless you should have some humility =P


How do you know it's correct? The only simplicial traingulation I know of is by splitting up the sphere into an icosahedron and then identifying all the opposite faces to get the proper antipodal action for the quotient.


I'm not interested in engaging with you further on this topic after you devolved into ad hominems against me in the other thread. I'm here to argue in good faith. Have a good day.


You made an incorrect assessment of a basic calculation in algebraic topology and claimed that it was correct. You didn't even look at what it was computing and simply looked at the final answer which lined up with the answer on Wikipedia. Simplicial calculations for projective planes are not simple. The usual calculations are done with cellular decomposition and that's why the LLM gives the wrong answer, the actual answer is not in the dataset and requires reasoning.


Are you confusing me with someone else? When I asked it GPT computed the homology from the CW decomposition of RP^2 with three cells. Which is a very simple exercise.

I recommend that you give it a try.


That's ok. It seems like LLMs know all about simplicial complexes and homology so I'll spend my time on more fruitful endeavors but thanks for the advice.


To be fair, it's not a simplicial complex, but simplicial and cellular homology coincide on triangulatable spaces like RP^2 so I gave it the benefit of the doubt =) algebraic topology is a pretty fun field regardless of how much a language model knows about it IMO.


Do you have a reference for this equivalency?


It's in Hatcher iirc


I dunno what you wanted to wager, but I would still be interested in the holes in this answer.

https://chatgpt.com/share/e84800dd-c714-42d4-977b-b446c5c5ed...


That's incorrect.


No it isn't.


lmao. you're totally right. RP^2 can be triangulated with a single triangle with all of its vertices identified. that's totally how you compute the simplicial decomposition of RP^2


I asked you explain why it's wrong, and all you said was "that's incorrect". Saying "no it isn't" got you to explain your answer far better than how I directly asked you to in the first place.


Don't worry, we'll have AGI soon and it will give the correct answer instead of whatever plausible nonsense it put together this time. I have faith.


That’s moving the goal post. GP’s assertion was that it couldn’t add numbers.


Computing simplicial homology is basic arithmetic. It's the same goal post.


The goal post is adding numbers, not some other calculation for which adding numbers is a step.


The computer can't do anything other than arithmetic


I mean, they are much better than humans at "mental" arithmetic.


I don't know what that means. There is nothing "mental" happening in the circuits of the computer or the function graph which is implemented on top of it.


In the context of AI, it's a term that I borrow from human cognition to describe rapid calculations without external aids.


The comparison still makes no sense. What would be an external aid for a computer?


Well for the LLM would be a calculator.


The LLM is a calculator. Think about it.


The LLM is an LLM. It runs on computer hardware which has an ALU, but that doesn't make the LLM a calculator. The LLM can, however, call out to a calculator to do addition when it deems necessary.


The LLM is not doing anything other than arithmetic calculations. Every operation an LLM is doing can be done with a calculator.


Typically we don't refer to matrix multiplication as arithmetic calculations, but you do you.


> Typically we don't refer to matrix multiplication as arithmetic calculations

You don't? Didn't you do them in school? Everyone calls that arithmetic's, you are just adding up and multiplying a bunch of numbers.


do you really multiply numbers but call that arithmetic? I must sincerely ask, where are you?


I'm pretty sure it's all arithmetic for an LLM but you do you too.


It's not.


Are you sure?


Yes


So the LLM is not doing arithmetic?


I implemented a hierarchical model that pooled utf8 encoded sequences to word vectors and trained it with a decoder on text denoising.

I think the future is a small word encoder model that replaces the token embedding codebook.

And here’s the reason: you can still create a codebook after training and then use the encoder model only for OOV. I’m not sure there’s an excuse not to be doing this, but open to suggestions.


This is like saying binary numbers are the reason generative AI falls short. Computers work with transistors which are either on or off so what are these people proposing as the next computational paradigm to fix the problems with binary generative AI?


What?


Tokenization as the main problem is a red herring. It's possible to get rid of the tokens entirely and train on byte sequences, it won't make a difference to why generative AI can't count or do basic arithmetic.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: