I tried the example about the table not fitting into the car using GPT4. Not only it answered it correctly, but it actively (but politely) argued with me when I tried to imply a different target for “it”:
(Edit: it did of course end up in a nonsense, so one can argue the point of not having the model of the world still stands; there is an “edge of bullshit”, but it seems to be moving further with each update)
At this point the arguments about about LLMs need to be tagged, which exact version they are talking about. (Probably most are talking about GPT3.5)
My (small) experience playing with both indicates there is a significant gap between the two.
I also tried this out and the balloon example as well.
User: Barbara, Mary, and Harry had red balloons. Ted had a blue balloon. Sam also had a red balloon. How many children have red balloons?
ChatGPT (GPT-4): Barbara, Mary, Harry, and Sam each had a red balloon. That means there are 4 children with red balloons.
User: What about Ted?
ChatGPT (GPT-4): Ted had a blue balloon, not a red one. So, he is not counted among the children with red balloons. Therefore, the total number of children with red balloons remains 4.
User: Why do you think I asked my original question?
ChatGPT (GPT-4): It seems that you asked your original question to test my ability to understand and process information provided in a text. You wanted to see if I could correctly identify and count the number of children with red balloons based on the details given.
User: Can you explain what mistakes a simpler model may have made on this question?
ChatGPT (GPT-4): A simpler language model may have made the following mistakes when answering your original question: ...
I like trying these sort of puzzles and "gotcha" prompts and I've found GPT-3.5 and lower versions are indeed not hard to confuse, but it's much harder to find simple examples that confuse GPT-4. Often times GPT-4 will even be able to explain the puzzle and why it's interesting. I think this is part of why GPT-4 feels qualitatively different. I understand (roughly) how transformers work, but GPT-4 does seem to have some emergent ability to reason, at least in the sense of making short chains of logical inferences, and this does seem to inform text completions not just on these sort of puzzles. That said I do think there's real danger of overconfidence and over reliance.
How it just takes these implicit assumptions in stride makes me worry that in the future these chatbots will be fine with assumptions and me - human - has a too square brain to go on without definitions. :)
Agreed GPT4 is better, but there's still simple world modeling issues:
> I have a chair in front of the TV and one in front of the computer. I put a watch under the chair in front of the TV. I swap the chairs. Where is the watch?
> After swapping the chairs, the watch is now under the chair that is in front of the computer.
I asked GPT 4 (Bing) your question, but added "(Please give a plausible explanation for why this question could be answered either way.)"
> This question could be answered either way depending on how you interpret the phrase “under the chair”. If you mean under the chair as in attached to the bottom of the chair, then the watch would move with the chair when you swap them. In that case, the watch would be in front of the computer. However, if you mean under the chair as in on the floor below the chair, then the watch would stay on the same spot when you swap the chairs. In that case, the watch would be in front of the TV. So, it depends on what you mean by “under”.
Yes, but that's a huge hint. GPT4 naively doesn't understand this.
Question:
> I have a chair in front of the TV and one in front of the computer. I put a watch under the chair in front of the TV. I swap the chairs. Where is the watch?
> Work out step by step. Is this question ambiguous?
Answer:
This question is not ambiguous, but let's break it down step by step to understand the situation and find the location of the watch.
You have a chair in front of the TV (Chair A) and one in front of the computer (Chair B).
You put a watch under Chair A (the one in front of the TV).
You swap the chairs, meaning Chair A is now in front of the computer and Chair B is in front of the TV.
After swapping the chairs, the watch remains under Chair A. Since Chair A is now in front of the computer, the watch is under the chair in front of the computer.
Ya gpt-4 is still confused: (it's interesting cause it does realize the watch doesn't move but still gets it wrong)
Model: GPT-4
I have a chair in front of the TV and one in front of the computer. I put a watch under the chair in front of the TV. I swap the chairs. Where is the watch?
Work out step by step. Also define "under the chair". Is the question ambiguous?
Step 1: A chair with a watch under it is in front of the TV.
Step 2: Another chair is in front of the computer.
Step 3: The chairs are swapped.
To clarify the term "under the chair," let's define it as a watch placed beneath the chair, such that the chair is directly above the watch.
Now let's analyze the situation after swapping the chairs.
After Step 3:
The chair that was initially in front of the TV (with the watch under it) is now in front of the computer.
The chair that was initially in front of the computer is now in front of the TV.
As the question is about the location of the watch, it is important to note that the watch was not mentioned to be moved along with the chair. Therefore, the watch remains in its original position, which is under the chair that was initially in front of the TV.
The question could be considered ambiguous if one assumes that the watch was also moved along with the chair. However, since the watch's movement was not explicitly mentioned, it is reasonable to assume that the watch remained in its original position during the chair swap.
So, the watch is still under the chair that was initially in front of the TV, which is now in front of the computer.
I have a chair in front of the TV and one in front of the computer. I placed a watch on the floor beneath the chair in front of the TV. Then, I swapped the positions of the chairs. Where is the watch now?
After swapping the positions of the chairs, the watch is now on the floor beneath the chair that was initially in front of the computer, since that chair is now in front of the TV.
This is not a competition in getting it to answer right, though. It’s that the instances where it answers wrong demonstrate its lack of inner mental model of what it is supposedly reasoning about, as well as a lack of meta-awareness. I think we tend to underestimate what mere linguistic correlation is capable of producing, and are too quick to attribute intelligent reasoning and an inner mental model to it.
The answering style matches the usual presentation of a Winograd schema, which is a bit of a problem when you actually want to test its general reasoning ability, because it could be that someone already posted your go-to test question and the answer on the internet and the model was trained on that.
Maybe we need to start obfuscating this stuff
import base64;import hashlib;f=lambda s:bytes(k^v for k,v in zip(hashlib.shake_256(b'wOmGLuIVp70').digest(len(s)),s));print(f(base64.decodebytes(b'Iq6VQOBRifQwwwO6gluzzWEnGIICFKKFwM1oMWmBsTrIhMj5AseeNmNUwtEZkthcz8m8v8qKmVIx7nEjPOsqOUimKaTIJ8OKk2STdo/SRZGLAOsBSmgGaNTYEgT3KaayJWmGVf7K/UN06VyosEHfyFZlsS+PHDS6B3bN94qrzdnOA9f12FwWuaTPNJhLGcXFX7r5H8mtWyt9uWq6n5AItEcRXId04ssR8jfvNray2fwFIh5qPHTdQvZ9ogKLJ4Y+nAro7ecRSXXgskAj5EBmo2YobRkfE26er/Tj9DZHNx81N64ujWvN8jiS7aNcYs/oaEyN0oqnZvia9qocrv6CfBr+wGGSG1oxk5mbhAgkQhfuyR6c8MVKNFKp8HFo6SR7auju8vLnYjcObcII88vRbbua/jQmakiWwmS68Y1e1Gqmqg==')).decode())
to avoid burning test questions. (GPT-3.5 gave the answer I expected on the 4th try, didn't test with GPT-4.)
The biggest takeaway from the article for me was GPT 4 has 100 trillion parameters, 500x more then GPT 3. that type of exponential scaling is going to hit an upper bound really quickly. So when you say it “moves further with each update” you should realize that the improvements are coming from throwing massively more scale at the problem and not some underlying improvement in the technique, and also balance the amount of improvement with the scale itself.
obviously gpt-4 is nowhere near 500x better than gpt-3. let’s say it’s 20% better (very generous imo). can they realistically 500x the model again? and if so, is that going to be worth an additional 4% gain to the original model quality? numbers are completely made up and math is probably wrong but i think i’m hopefully making my point, that diminishing returns will quickly become a blocker with this type of scaling.
I don't believe they spent 7 months just "making gpt-4 safer" - I think they spent a very long time doing human reinforcement learning to make it better and now hope to speed that process up the next time using gpt-4 itself as the reinforcement.
AFAIK the paper mentioned that RLHF mostly decreased its capabilities. It seems more likely to me that just longer normal training was one of the main reason for the increased capabilities.
I tried the car-table example too (via Bing), using the exact same wording, and got a totally different confused reply, so it must think the two parsings of the sentence are closely probable (and isn't deterred by the semantics).
I tried it with Bard too, which also thought the table was too small to fit in the car, and doubled down on this when pressed by explaining how the table might be too narrow or deep or heavy(!) to fit in the car.
I was a bit surprised to see GPT-4 get this wrong, even it it's only doing so some of the time (sampling temperature randomness?).
Just tried again with GPT-4 (exactly as before), and reply this time was simply "The table didn’t fit in the car because it was too big. Do you have any other questions?".
Bing/GPT bolded the words "too big".
But as I said before, the fact that it sometimes gets it wrong must mean it doesn't see one parsing as much to be favored over the other, which is surprising given how competent it generally is.
Rodney Brooks, famous for arguing that that in order for robots to accomplish everyday tasks in an environment shared by humans, their higher cognitive abilities, including abstract thinking emulated by symbolic reasoning, need to be based on the primarily sensory-motor coupling (action) with the environment, complemented by the proprioceptive sense which is a key component in hand–eye coordination, is unsurprisingly bearish on GPT capabilities especially for robotics. He concludes with a prediction that there will be new categories of pornography. So he will certainly get at least one prediction right.
Interestingly, he keeps a record of his past predictions and how they are faring. The link is at the bottom of TFA. Seems like he has been reasonably successful at this kind of prediction.
He makes nine predictions at the end which he calls 'specific'. Four of them are most directly about GPT, and of these four, three are of the type "GPT is powerful and potentially dangerous or bad and might make bad or unexpected or scary things happen."
(This follows the pattern I've noticed where the term "AGI skeptic" will soon, if not already, mean "I don't trust AGIs in positions of authority or power" rather than "I don't think the technology is capable of matching our level of cognition.").
So let's grant him 8/9 of his predictions, and turn our attention to the one that seems to be his 'real' prediction. This specific and direct prediction about GPT is that "There will be no viable robotics applications that harness the serious power of GPTs in any meaningful way." which I mean, maybe he's trying to use the words "viable" or "serious" or "meaningful" to weaken his claim so that it's never wrong. If we assume he's not just making a vacuously weakened prediction, then I wonder if he has seen https://palm-e.github.io/ for example. You could say the prediction is wrong already, or that it will obviously be wrong before 2030, or that his phrasing makes it impossible for the prediction to ever be wrong.
Speaking of when or if his predictions can be shown to be wrong, I thought it was weird that he only says "These predictions cover the time between now and 2030." whereas for his earlier predictions he seems to have made a more formalized way of incorporating his dates into his predictions which he's not using for his GPT predictions for some reason:
---
I specify dates in three different ways:
NIML meaning “Not In My Lifetime, i.e., not until after January 1st, 2050
NET some date, meaning “No Earlier Than” that date.
BY some date, meaning “By” that date.
Sometimes I will give both a NET and a BY for a single prediction, establishing a window in which I believe it will happen.
And this popular insistence that GPT has no 'world model' is also false IMHO. If a GPT is trained on real-world sensory data, it will develop a spacio-temporal 'world model', almost by definition.
LLMs are trained on human text. Their 'world model' is the world encoded in that text, rather than as sounds and retina images, etc, and so it's inconsistent, incomplete, and 'unreal' in a way which is easy to expose.
>> If a GPT is trained on real-world sensory data, it will develop a spacio-temporal 'world model', almost by definition.
Or, most likely, it will develop a model of the data it was trained on, and not the world this data came from.
I'm also curious what you mean by "real-world sensory data". So far, deep neural net models have been trained mainly (say 80%) on things like text, images, time series... and that's about it really. What is "real-world sensory data" in that context?
What is your 'world model'? It is, ultimately, associations between sensory inputs, and outputs, and sequences of them. Which, as Descartes pointed out, could all be fake, and maybe not from any real world at all.
"Or, most likely, it will develop a model of the data it was trained on, and not the world this data came from."
I mean the point is what is the difference? Is your brain trained on 'the real world', or on your senses (data) of the real world? Descartes argued that you can't tell if there's a real world at all.
By 'real-world sensor data' I mean that the models can be trained on sensor data from the real world (video, audio, feedback from motor outputs, etc), rather than on text, which is only abstractly related to the real world. I believe this is called the token-grounding problem, and goes back to Searle's Chinese Room thought experiment.
And "predicting words" given a huge corpus of interesting text by itself seems like it could be amazing. Predicting words effectively can mean that it has internalized all the logic used to generate the original texts. At some extrema simply predicting words could mean more than just "knowing facts", but actually having the accumulated smarts of those that made large contributions to the corpus.
Allow me to participate in some hyperbole here, but "predicting words" could result in a model that is "smarter" in most capacities than most humans.
Exactly. These kinds of reductive arguments always bother me because they ignore emergent effects. Chess engines are “just” running minimax? Well guess what, they’re evaluating millions of positions per second.
If this whole intelligence thing can be reduced to “just” predicting the next word, maybe we’re not as special as we thought. In fact anyone who reads the Wikipedia article on cetacean intelligence will quickly realize that we aren’t the hot shit we once thought we were.
One expects the academic community to step in and provide this kind of analysis, namely that an LLM combined with a great deal of computing capacity exhibits capabilities greater than the sum of its parts.
I'm a long time LLM enjoyer and by far the best analysis I've seen for that is https://generative.ink/posts/simulators/ but it's way "too much" for normal people who are learning about stochastic parrots and blurry jpegs instead.
> But perhaps that says something about how we humans relate to language and the world, and that language really only, perhaps, has a tenuous connection to the world in which we live; a scary thought.
https://mobile.twitter.com/ayourtch/status/16388720452817879...
(Edit: it did of course end up in a nonsense, so one can argue the point of not having the model of the world still stands; there is an “edge of bullshit”, but it seems to be moving further with each update)
At this point the arguments about about LLMs need to be tagged, which exact version they are talking about. (Probably most are talking about GPT3.5)
My (small) experience playing with both indicates there is a significant gap between the two.