What Will Transformers Transform?

ay · on March 23, 2023

I tried the example about the table not fitting into the car using GPT4. Not only it answered it correctly, but it actively (but politely) argued with me when I tried to imply a different target for “it”:

https://mobile.twitter.com/ayourtch/status/16388720452817879...

(Edit: it did of course end up in a nonsense, so one can argue the point of not having the model of the world still stands; there is an “edge of bullshit”, but it seems to be moving further with each update)

At this point the arguments about about LLMs need to be tagged, which exact version they are talking about. (Probably most are talking about GPT3.5)

My (small) experience playing with both indicates there is a significant gap between the two.

sparsevector · on March 23, 2023

I also tried this out and the balloon example as well.

User: Barbara, Mary, and Harry had red balloons. Ted had a blue balloon. Sam also had a red balloon. How many children have red balloons?

ChatGPT (GPT-4): Barbara, Mary, Harry, and Sam each had a red balloon. That means there are 4 children with red balloons.

User: What about Ted?

ChatGPT (GPT-4): Ted had a blue balloon, not a red one. So, he is not counted among the children with red balloons. Therefore, the total number of children with red balloons remains 4.

User: Why do you think I asked my original question?

ChatGPT (GPT-4): It seems that you asked your original question to test my ability to understand and process information provided in a text. You wanted to see if I could correctly identify and count the number of children with red balloons based on the details given.

User: Can you explain what mistakes a simpler model may have made on this question?

ChatGPT (GPT-4): A simpler language model may have made the following mistakes when answering your original question: ...

I like trying these sort of puzzles and "gotcha" prompts and I've found GPT-3.5 and lower versions are indeed not hard to confuse, but it's much harder to find simple examples that confuse GPT-4. Often times GPT-4 will even be able to explain the puzzle and why it's interesting. I think this is part of why GPT-4 feels qualitatively different. I understand (roughly) how transformers work, but GPT-4 does seem to have some emergent ability to reason, at least in the sense of making short chains of logical inferences, and this does seem to inform text completions not just on these sort of puzzles. That said I do think there's real danger of overconfidence and over reliance.

imjonse · on March 23, 2023

I thought the gotcha was that nowhere was it stated which of them are actually children.

kzrdude · on March 23, 2023

How it just takes these implicit assumptions in stride makes me worry that in the future these chatbots will be fine with assumptions and me - human - has a too square brain to go on without definitions. :)

usaar333 · on March 23, 2023

Agreed GPT4 is better, but there's still simple world modeling issues:

> I have a chair in front of the TV and one in front of the computer. I put a watch under the chair in front of the TV. I swap the chairs. Where is the watch?

> After swapping the chairs, the watch is now under the chair that is in front of the computer.

Interestingly, GPT3 got it correct.

letmevoteplease · on March 23, 2023

I asked GPT 4 (Bing) your question, but added "(Please give a plausible explanation for why this question could be answered either way.)"

> This question could be answered either way depending on how you interpret the phrase “under the chair”. If you mean under the chair as in attached to the bottom of the chair, then the watch would move with the chair when you swap them. In that case, the watch would be in front of the computer. However, if you mean under the chair as in on the floor below the chair, then the watch would stay on the same spot when you swap the chairs. In that case, the watch would be in front of the TV. So, it depends on what you mean by “under”.

usaar333 · on March 23, 2023

Yes, but that's a huge hint. GPT4 naively doesn't understand this.

Question:

> I have a chair in front of the TV and one in front of the computer. I put a watch under the chair in front of the TV. I swap the chairs. Where is the watch?

> Work out step by step. Is this question ambiguous?

Answer:

This question is not ambiguous, but let's break it down step by step to understand the situation and find the location of the watch.

You have a chair in front of the TV (Chair A) and one in front of the computer (Chair B). You put a watch under Chair A (the one in front of the TV). You swap the chairs, meaning Chair A is now in front of the computer and Chair B is in front of the TV. After swapping the chairs, the watch remains under Chair A. Since Chair A is now in front of the computer, the watch is under the chair in front of the computer.

blueorange8 · on March 23, 2023

Ya gpt-4 is still confused: (it's interesting cause it does realize the watch doesn't move but still gets it wrong)

Model: GPT-4

I have a chair in front of the TV and one in front of the computer. I put a watch under the chair in front of the TV. I swap the chairs. Where is the watch?

Work out step by step. Also define "under the chair". Is the question ambiguous?

Step 1: A chair with a watch under it is in front of the TV. Step 2: Another chair is in front of the computer. Step 3: The chairs are swapped.

To clarify the term "under the chair," let's define it as a watch placed beneath the chair, such that the chair is directly above the watch.

Now let's analyze the situation after swapping the chairs.

After Step 3:

The chair that was initially in front of the TV (with the watch under it) is now in front of the computer. The chair that was initially in front of the computer is now in front of the TV. As the question is about the location of the watch, it is important to note that the watch was not mentioned to be moved along with the chair. Therefore, the watch remains in its original position, which is under the chair that was initially in front of the TV.

The question could be considered ambiguous if one assumes that the watch was also moved along with the chair. However, since the watch's movement was not explicitly mentioned, it is reasonable to assume that the watch remained in its original position during the chair swap.

So, the watch is still under the chair that was initially in front of the TV, which is now in front of the computer.

blueorange8 · on March 23, 2023

If I ask it like this, if gets it right:

I have a chair in front of the TV and one in front of the computer. I placed a watch on the floor beneath the chair in front of the TV. Then, I swapped the positions of the chairs. Where is the watch now?

After swapping the positions of the chairs, the watch is now on the floor beneath the chair that was initially in front of the computer, since that chair is now in front of the TV.

layer8 · on March 23, 2023

This is not a competition in getting it to answer right, though. It’s that the instances where it answers wrong demonstrate its lack of inner mental model of what it is supposedly reasoning about, as well as a lack of meta-awareness. I think we tend to underestimate what mere linguistic correlation is capable of producing, and are too quick to attribute intelligent reasoning and an inner mental model to it.

zootreeves · on March 23, 2023

Question isn't clear if the watch is stuck to the bottom of the chair.

kenjackson · on March 23, 2023

I just tried it and it says it’s under the chair in front of the computer. I used the Bing model.

yorwba · on March 23, 2023

The answering style matches the usual presentation of a Winograd schema, which is a bit of a problem when you actually want to test its general reasoning ability, because it could be that someone already posted your go-to test question and the answer on the internet and the model was trained on that.

Maybe we need to start obfuscating this stuff

  import base64;import hashlib;f=lambda s:bytes(k^v for k,v in zip(hashlib.shake_256(b'wOmGLuIVp70').digest(len(s)),s));print(f(base64.decodebytes(b'Iq6VQOBRifQwwwO6gluzzWEnGIICFKKFwM1oMWmBsTrIhMj5AseeNmNUwtEZkthcz8m8v8qKmVIx7nEjPOsqOUimKaTIJ8OKk2STdo/SRZGLAOsBSmgGaNTYEgT3KaayJWmGVf7K/UN06VyosEHfyFZlsS+PHDS6B3bN94qrzdnOA9f12FwWuaTPNJhLGcXFX7r5H8mtWyt9uWq6n5AItEcRXId04ssR8jfvNray2fwFIh5qPHTdQvZ9ogKLJ4Y+nAro7ecRSXXgskAj5EBmo2YobRkfE26er/Tj9DZHNx81N64ujWvN8jiS7aNcYs/oaEyN0oqnZvia9qocrv6CfBr+wGGSG1oxk5mbhAgkQhfuyR6c8MVKNFKp8HFo6SR7auju8vLnYjcObcII88vRbbua/jQmakiWwmS68Y1e1Gqmqg==')).decode())

to avoid burning test questions. (GPT-3.5 gave the answer I expected on the 4th try, didn't test with GPT-4.)

jxy · on March 23, 2023

GPT-4 answered correctly and explained the chemical reaction.

yorwba · on March 23, 2023

That's good to hear.

thefreeman · on March 23, 2023

The biggest takeaway from the article for me was GPT 4 has 100 trillion parameters, 500x more then GPT 3. that type of exponential scaling is going to hit an upper bound really quickly. So when you say it “moves further with each update” you should realize that the improvements are coming from throwing massively more scale at the problem and not some underlying improvement in the technique, and also balance the amount of improvement with the scale itself.

obviously gpt-4 is nowhere near 500x better than gpt-3. let’s say it’s 20% better (very generous imo). can they realistically 500x the model again? and if so, is that going to be worth an additional 4% gain to the original model quality? numbers are completely made up and math is probably wrong but i think i’m hopefully making my point, that diminishing returns will quickly become a blocker with this type of scaling.

og_kalu · on March 23, 2023

Gpt-4 doesn't have 100 trillion parameters. It's not much larger than 3. It just has a lot more data according to Altman.

qualudeheart · on March 23, 2023

Altman himself has said it doesn’t have 100T parameters. It has more data in line with the Chinchilla laws.

blueorange8 · on March 23, 2023

I don't believe they spent 7 months just "making gpt-4 safer" - I think they spent a very long time doing human reinforcement learning to make it better and now hope to speed that process up the next time using gpt-4 itself as the reinforcement.

r0uv3n · on March 24, 2023

AFAIK the paper mentioned that RLHF mostly decreased its capabilities. It seems more likely to me that just longer normal training was one of the main reason for the increased capabilities.

cinntaile · on March 23, 2023

We shouldn't forget that at the same time effort is being spent to achieve the same results with a lot fewer parameters.

carbocation · on March 23, 2023

I would hesitate to assume that is true. The 100 trillion param number has not been confirmed by anyone who could know if it’s true.

HarHarVeryFunny · on March 23, 2023

I tried the car-table example too (via Bing), using the exact same wording, and got a totally different confused reply, so it must think the two parsings of the sentence are closely probable (and isn't deterred by the semantics).

I tried it with Bard too, which also thought the table was too small to fit in the car, and doubled down on this when pressed by explaining how the table might be too narrow or deep or heavy(!) to fit in the car.

I was a bit surprised to see GPT-4 get this wrong, even it it's only doing so some of the time (sampling temperature randomness?).

HarHarVeryFunny · on March 23, 2023

Just tried again with GPT-4 (exactly as before), and reply this time was simply "The table didn’t fit in the car because it was too big. Do you have any other questions?".

Bing/GPT bolded the words "too big".

But as I said before, the fact that it sometimes gets it wrong must mean it doesn't see one parsing as much to be favored over the other, which is surprising given how competent it generally is.

ftxbro · on March 23, 2023

Rodney Brooks, famous for arguing that that in order for robots to accomplish everyday tasks in an environment shared by humans, their higher cognitive abilities, including abstract thinking emulated by symbolic reasoning, need to be based on the primarily sensory-motor coupling (action) with the environment, complemented by the proprioceptive sense which is a key component in hand–eye coordination, is unsurprisingly bearish on GPT capabilities especially for robotics. He concludes with a prediction that there will be new categories of pornography. So he will certainly get at least one prediction right.

topaz0 · on March 23, 2023

Interestingly, he keeps a record of his past predictions and how they are faring. The link is at the bottom of TFA. Seems like he has been reasonably successful at this kind of prediction.

ftxbro · on March 23, 2023

He makes nine predictions at the end which he calls 'specific'. Four of them are most directly about GPT, and of these four, three are of the type "GPT is powerful and potentially dangerous or bad and might make bad or unexpected or scary things happen."

(This follows the pattern I've noticed where the term "AGI skeptic" will soon, if not already, mean "I don't trust AGIs in positions of authority or power" rather than "I don't think the technology is capable of matching our level of cognition.").

So let's grant him 8/9 of his predictions, and turn our attention to the one that seems to be his 'real' prediction. This specific and direct prediction about GPT is that "There will be no viable robotics applications that harness the serious power of GPTs in any meaningful way." which I mean, maybe he's trying to use the words "viable" or "serious" or "meaningful" to weaken his claim so that it's never wrong. If we assume he's not just making a vacuously weakened prediction, then I wonder if he has seen https://palm-e.github.io/ for example. You could say the prediction is wrong already, or that it will obviously be wrong before 2030, or that his phrasing makes it impossible for the prediction to ever be wrong.

Speaking of when or if his predictions can be shown to be wrong, I thought it was weird that he only says "These predictions cover the time between now and 2030." whereas for his earlier predictions he seems to have made a more formalized way of incorporating his dates into his predictions which he's not using for his GPT predictions for some reason:

---

I specify dates in three different ways:

NIML meaning “Not In My Lifetime, i.e., not until after January 1st, 2050

NET some date, meaning “No Earlier Than” that date.

BY some date, meaning “By” that date.

Sometimes I will give both a NET and a BY for a single prediction, establishing a window in which I believe it will happen.

oh_sigh · on March 23, 2023

Has he predicted that his future predictions will be as successful as his previous predictions?

jacknews · on March 24, 2023

So just train a GPT on sensory input, etc.

His prediction #4 rings false to me.

And this popular insistence that GPT has no 'world model' is also false IMHO. If a GPT is trained on real-world sensory data, it will develop a spacio-temporal 'world model', almost by definition.

LLMs are trained on human text. Their 'world model' is the world encoded in that text, rather than as sounds and retina images, etc, and so it's inconsistent, incomplete, and 'unreal' in a way which is easy to expose.

But there is a 'world model' of sorts there.

YeGoblynQueenne · on March 24, 2023

>> If a GPT is trained on real-world sensory data, it will develop a spacio-temporal 'world model', almost by definition.

Or, most likely, it will develop a model of the data it was trained on, and not the world this data came from.

I'm also curious what you mean by "real-world sensory data". So far, deep neural net models have been trained mainly (say 80%) on things like text, images, time series... and that's about it really. What is "real-world sensory data" in that context?

jacknews · on March 24, 2023

I think, therefore I am.

What is your 'world model'? It is, ultimately, associations between sensory inputs, and outputs, and sequences of them. Which, as Descartes pointed out, could all be fake, and maybe not from any real world at all.

YeGoblynQueenne · on March 25, 2023

I don't understand. What does all this mean? I just asked what you mean by "real-world sensor data". What does Descartes have to do with it?

jacknews · on March 25, 2023

"Or, most likely, it will develop a model of the data it was trained on, and not the world this data came from."

I mean the point is what is the difference? Is your brain trained on 'the real world', or on your senses (data) of the real world? Descartes argued that you can't tell if there's a real world at all.

By 'real-world sensor data' I mean that the models can be trained on sensor data from the real world (video, audio, feedback from motor outputs, etc), rather than on text, which is only abstractly related to the real world. I believe this is called the token-grounding problem, and goes back to Searle's Chinese Room thought experiment.

blueorange8 · on March 23, 2023

And I would argue that just language is enough

flangola7 · on March 23, 2023

It might be, and it might be even better with both.

GPT-4 language performance improved after adding vision training.

Sunhold · on March 23, 2023

This article repeats the rumor that GPT4 has 100 trillion parameters, which Sam Altman has called "complete bullshit."[1]

[1] https://www.youtube.com/watch?v=ebjkD1Om4uw

neilellis · on March 23, 2023

I think people are getting confused withe the human brain's number of synapses. That's why that number is a 'magic' number.

zvonimirs · on March 23, 2023

exactly

neilellis · on March 23, 2023

It can only predict words, is like saying all neuron's are just switches, all matter is just protons, electrons and neutrons.

It's just reductive. What we're seeing from LLMs is simply amazing considering what they are supposed to just do.

Whatever LLMs are capable of, big or small it exceeds such reductive thinking as 'it predicts words'.

kenjackson · on March 23, 2023

And "predicting words" given a huge corpus of interesting text by itself seems like it could be amazing. Predicting words effectively can mean that it has internalized all the logic used to generate the original texts. At some extrema simply predicting words could mean more than just "knowing facts", but actually having the accumulated smarts of those that made large contributions to the corpus.

Allow me to participate in some hyperbole here, but "predicting words" could result in a model that is "smarter" in most capacities than most humans.

blueorange8 · on March 23, 2023

It's basically creating an accurate world view

janalsncm · on March 23, 2023

Exactly. These kinds of reductive arguments always bother me because they ignore emergent effects. Chess engines are “just” running minimax? Well guess what, they’re evaluating millions of positions per second.

If this whole intelligence thing can be reduced to “just” predicting the next word, maybe we’re not as special as we thought. In fact anyone who reads the Wikipedia article on cetacean intelligence will quickly realize that we aren’t the hot shit we once thought we were.

clarge1120 · on March 23, 2023

One expects the academic community to step in and provide this kind of analysis, namely that an LLM combined with a great deal of computing capacity exhibits capabilities greater than the sum of its parts.

ftxbro · on March 23, 2023

I'm a long time LLM enjoyer and by far the best analysis I've seen for that is https://generative.ink/posts/simulators/ but it's way "too much" for normal people who are learning about stochastic parrots and blurry jpegs instead.

K2L8M11N2 · on March 23, 2023

> There will be no viable robotics applications that harness the serious power of GPTs in any meaningful way.

That's a weird prediction to make, considering that PaLM-E does exactly that: https://palm-e.github.io/

zvonimirs · on March 23, 2023

It might be a hype like many other things (khmcryptokhm) but I think this one is just different. It might change the world really drastically.

wsgeorge · on March 23, 2023

> but I think this one is just different. It might change the world really drastically.

FWIW crypto did change the world in a number of ways:

1. CBDCs actively being explored.

2. Increased concern about the environmental cost of compute.

3. It probably raised the public's level of bullshit detection...

antibasilisk · on March 23, 2023

not to mention the fact that we now have a way of transacting completely anonymously online, the value of which cannot be understated

topaz0 · on March 23, 2023

Given the discourse around chatgpt (3) seems questionable, or at least not general.

janalsncm · on March 23, 2023

Machine learning was used long before ChatGPT came around. It’s not a fad or a bubble.

otabdeveloper4 · on March 23, 2023

> It might change the world really drastically.

Or it might not.

I want to believe, like the meme says.

0xdeadbeefbabe · on March 23, 2023

> But perhaps that says something about how we humans relate to language and the world, and that language really only, perhaps, has a tenuous connection to the world in which we live; a scary thought.

Why is that a scary thought? Is this related to business speak https://www.atrixnet.com/bs-generator.html and how it is scary?

layer8 · on March 23, 2023

I think he means language in general, in which case it would be scary in a similar way as not being able to trust your senses.

aaroninsf · on March 23, 2023

Evergreen:

Ximm's Law: every critique of AI assumes to some degree that contemporary implementations will not, or cannot, be improved upon.

Lemma: any statement about AI which uses the word "never" to preclude some feature from future realization is false.

k__ · on March 23, 2023

Human: write a scientist skill classifier in Python.

GPT: is blatantly racist and makes syntax errors.

Other human: It writes a Python program without getting the indentation right!!!

Made my day, thanks.

jxy · on March 23, 2023

I guess it's joke, but I had to try. GPT-4's interpretation of the prompt is "on point":

    class ScientistSkillClassifier:
        def __init__(self):
            self.skill_domains = {
                "biology": ["genetics", "ecology", "microbiology", "evolution"],
                "chemistry": ["analytical", "organic", "inorganic", "physical"],
                "physics": ["quantum", "relativity", "thermodynamics", "mechanics"],
                "computer_science": ["algorithms", "machine_learning", "artificial_intelligence", "programming"],
                "earth_science": ["geology", "meteorology", "oceanography", "climatology"],
                "mathematics": ["calculus", "statistics", "geometry", "algebra"]
            }
    
        def classify(self, skill):
            for domain, skills in self.skill_domains.items():
                if skill.lower() in skills:
                    return domain
            return "unknown"

transfire · on March 23, 2023

Well Mr. Brooks, I am certain you have at least one thing right… #9.

InfiniteRand · on March 23, 2023

What will Transformers Transform? Transformation!