You're describing over fitting to some look up table.
Can't be what's happening here. Because the examples LLMs are answering are well out of bounds of the "100^2" training data.
The internet is huge but it's not that huge. One can easily find chatGPT saying, doing or creating things that obviously come from a generalized model.
It's actually trivial to find examples of chatGPT answering questions with responses that are wholly unique and distinct from the training data, as in the answer it gave you could not have existed anywhere on the internet.
Clearly humans don't need that much training data. We can form generalizations from a much smaller sample size.
This does not indicate that for machine learning a generalization doesn't exist in LLMs when clearly the answers demonstrate that it does.
Like yes to some extent there is a mild amount of generalization in that it is not literally regurgitating the internet and it to some extent mixes text really well but I don't think that's obviously the full on generalization of understanding that humans have.
These models obviously are more sample efficient at learning relationships than a literal lookup table but like I've already said: my example was obviously extreme for the purposes of illustration that sample efficiency does seem to matter. If you used 100^2 - 1 samples, I'm still not confident you truly understand the concept. However, if you use 5 samples: I'm pretty sure you've generalized so I was hoping to illustrate a gradient.
I want to reemphasize another portion of my comment: it really does seem that when you step outside of the domain of the internet, the error rates rise dramatically especially when there is completely no analogous situation. Furthermore, the further from the internet samples, the seemingly more likely the error which should not occur if it understood these concepts for the purposes of generalization. Do you have links to examples you'd be willing to discuss?
Many examples I see are directly one of the top results on Google. The more impressive ones mix multiple results with some coherency. Sometimes people ask for something novel but there's a weirdly close parallel on the internet.
I think this isn't as impressive at least towards generalization. It seems to stitch concepts pretty haphazardly like in the novel language above that doesn't seem to respect the description (after all, why use brackets in a supposedly indentation based language). However, many languages do use brackets. It seems to suggest it correlates probable answers rather than reasons.
>I want to reemphasize another portion of my comment: it really does seem that when you step outside of the domain of the internet, the error rates rise dramatically especially when there is completely no analogous situation.
This is not surprising. A human would suffer from similar errors at a similar rate if it were exclusively fed an interpretation of reality that only consisted of text from the internet.
>These models obviously are more sample efficient at learning relationships than a literal lookup table but like I've already said: my example was obviously extreme for the purposes of illustration that sample efficiency does seem to matter. If you used 100^2 - 1 samples,
Even within the context of the internet there are enough conversational scenarios where you can have chatGPT answer things in ways that are far more generalized then "minor".
Read it to the end. In the beginning you could say that the terminal emulation does exist as a similar copy in some form on the internet. But the structure that was built in the end is unique enough that it could be said nothing like it has ever existed on the internet.
Additionally you have to realize that while bash commands and results do exist in ON the internet, chatGPT cannot simply copy the logic and interactive behavior of the terminal from text. In order to do what it did (even in the beginning) it must "understand" what a shell is and it has to derive that understanding from internet text.
> This is not surprising. A human would suffer from similar errors at a similar rate if it were exclusively fed an interpretation of reality that only consisted of text from the internet.
I think this is surprising at least if the bot actually understands, especially for domains like math. It makes errors (like in adding large numbers) that shouldn't occur if it wasn't smearing together internet data. We would expect there to be many homework examples on the internet of adding relatively small numbers but less of large numbers. A large portion of what makes math interesting is that many of the structures we are interested in exist in large examples and in small examples (though not always) so if you understand the structure, it should be able to guide you pretty far. Presumably most humans (assuming they understand natural language) can read a description of addition then (with some trial and error) get it right for small cases. Then when presented with a large case would generalize easily. I don't usually guess out the output and instead internally try to generate and algorithm I follow.
When I first saw that a while back, I thought that was a more impressive example but only marginally more so than the natural language examples. Like how these models are trained under supervised learning imply that it should be able to capture relationships between text well. Like you said, there's a lot of content associating the output of a terminal with the input.
Maybe this is where I think we're miscommunicating right. I don't think even for natural language it's purely just copying text from the internet. It is capturing correlations and I would argue that simply capturing correlations doesn't imply an understanding. To some extent, it knows what the output of curl is supposed to look like and can use attention to figure out the website to then generate what an intended website is supposed to look like. Maybe the sequential nature of the commands is kind of impressive but I would argue that at least for the jokes.txt example, that particular sequence is at least probably very analogous to some tutorial on the internet. It's difficult to find since I would want to limit myself before 2021.
It can correlate the output of a shell to the input, and to some extent, the relationships between the output of a command and input are well produced and its training and suffused it with information about what terminal outputs (is this what you are referring to when you say it has to derive understanding from internet text?), but it doesn't seem to be reasoning about the terminal despite probably being trained on a lot of documentation about these commands.
Like we can imagine that this relationship is also not too difficult to capture. A lot of internet websites will have something like
| command |
some random text
| result |
where the bit in the middle varies but the result remains more consistent. So you should be able to treat that command result pair as a sort of sublanguage.
Like as a preliminary consistency check that I just performed right, I basically ran the same prompt and then did a couple of checks that maybe show confusing behavior if it's not just smearing popular text.
I asked it for a fresh Linux installation then checked that golang wasn't installed (it wasn't). However, when I ran find / -name go, it found a Go directory (/usr/local/go) but when I run "cd /usr/local/go" also tells me I can't cd into the directory since no such file exists which would be confusing behavior if it wasn't just capturing correlations and actually understanding what find does.
I "ls ." the current directory (for some reason I was in a directory with a single "go" directory now despite never having cd'ed to /usr/local) but then ran "stat Documents/" and it didn't tell me the directory didn't exist which is also confusing if it wasn't just generating similar output to the internet.
I asked it to "curl -Z http://google.com" (-Z is not a valid option) and it told me http is not a valid protocol for libcurl. Funnily enough, running "curl http://google.com" does in fact let me fetch the webpage.
I'm a bit suspicious that the commands that the author ran are actually pretty popular so it can sort of fuzz out what the "proper" response is. I would argue that the output appears mostly to be a fuzzed version of what is popular output on the internet.
Keep in mind there's a token limit. Once you pass that limit it no longer remembers.
Yes. You are pointing out various flaws which again is quite obvious. Everyone knows of the inconsistencies with these LLMs.
Too this I again say that the LLM understands some things and doesn't understand other things, its understanding of things is inconsistent and incomplete.
The only thing needed to prove understanding is to show chatGPT building something that can only be built by pure understanding. If you see one instance of this, then it's sufficient to say on some level chatGPT understands aspects of your query rather then doing a trivial query-response correlation you're implying is possible here.
Let's examine the full structure that was built here:
chatGPT was running an emulated terminal with an emulated internet with an emulated chatGPT with an emulated terminal.
It's basically a recursive model of a computer and the internet relative to itself. There is literally no exact copy of this anywhere in it's training data. chatGPT had to construct this model via correctly composing multiple concepts together.
The composition cannot occur correctly without chatGPT understanding how the components compose.
It's kind of strange that this was ignored. It was the main point of the example. I didn't emphasize this because this structure is obviously the heart of the argument if the article was read to the end.
Literally to generate the output of the final example chatGPT has to parse bash input execute the command over a simulated internet onto a simulated version of himself and again parse the bash sub command. It has a internal stack that it must use to put all the output together into a final json output.
So while It is possible for simple individual commands to be correlated with similar training data... for the highly recursive command on the final prompt.... There is zero explanation for how chatGPT can pick this up off of some correlation. There is virtually no identical structure on the internet... It has to understand the users query and compose the response from different components. That is the only explanation left.
Can't be what's happening here. Because the examples LLMs are answering are well out of bounds of the "100^2" training data.
The internet is huge but it's not that huge. One can easily find chatGPT saying, doing or creating things that obviously come from a generalized model.
It's actually trivial to find examples of chatGPT answering questions with responses that are wholly unique and distinct from the training data, as in the answer it gave you could not have existed anywhere on the internet.
Clearly humans don't need that much training data. We can form generalizations from a much smaller sample size.
This does not indicate that for machine learning a generalization doesn't exist in LLMs when clearly the answers demonstrate that it does.