> And that's because, despite all of its training data, it's not capable of actually reasoning.
Your conclusion doesn't follow from your premise.
None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.
> None of these models are trained to do their best on any kind of test
How do you know GPT-4 wasn't trained to do well on these tests? They didn't disclose what they did for it, so you can't say it wasn't trained to do well on these tests. That could be the magic sauce for it.
They are trained to predict next tokens in a stream.
That is the learning algorithm.
The algorithm they learn, in response, is quite different. Since that learned algorithm is based on the training data.
In this case the models learn to sensibly continue text or conversations. And they are doing it so well it’s clear they have learned to “reason” at an astonishing level.
Sometimes, not as good as a human.
But in a tremendous number of ways they are better.
Try writing an essay about the many-worlds interpretation of the quantum field equation, from the perspective of Schrödinger, with references to his personal experiences, using analogies with medical situations, formatted as a brief for the Supreme Court, in Dr. Seuss prose, in a random human language of choice.
In real time.
While these models have some trouble with long chains of reasoning, and reasoning about things they don’t have experiences (different modalities, although sometimes they are surprisingly good), it is clear that they can also reason combining complex information drawn from there whole knowledge base much faster and sensibly than any human has ever come close to.
Where they exceed us, they trounce us.
And where they don’t, it’s amazing how fast they are improving. Especially given that year to year, biological human capabilities are at a relative standstill.
——
EDIT: I just tried the above test. The result was wonderful whimsical prose and references, that made sense at a very basic level, that a Supreme Court of 8 year olds would likely enjoy, especially if served along with some Dr. Seuss art! In about 10-15 seconds.
Viewed as a solution to an extremely complex constraint problem, that is simply amazing. And far beyond human capabilities on this dimension.
You are right that the process involves predicting words from training data. But you can still make training data focused on passing these tests. Adding millions of test questions to all of these to optimize for answering test questions is perfectly doable when you have the resources OpenAI has.
A strong hint to what they focused on in their training process is what metrics they used in their marketing of the model. You should always bet on models being optimized to perform on whatever metrics they themselves give you when they market the model. Look at the gpt-4 announcement, what metrics did they market? So what metrics should we expect they optimized the model for?
Exam results are the first metric they mentions, so exams was probably one of their top priorities when they trained gpt-4.
Yes, absolutely. They can adjust performance priorities.
By the relative mix of training data, additional fine tuning training phases, and/or pre-prompts that give the model extra guidance relative to particular task types.
Your conclusion doesn't follow from your premise.
None of these models are trained to do their best on any kind of test. They're just trained to predict the next word. The fact that they do well at all on tests they haven't seen is miraculous, and demonstrates something very akin to reasoning. Imagine how they might do if you actually trained them or something like them to do well on tests, using something like RL.