Part 1: what are n-grams Part 2: it's using embeddings (but a lot of words witho...

jbay808 · on May 25, 2023

I'm glad to see someone express this view, because I think this gets to the heart of the question. How does a stochastic parrot learn to sort lists?

Embeddings are part of the compression-by-abstraction that I'm explaining in the first two parts, but the embeddings generated by an LLM go beyond the normal word2vec picture that most people have of embeddings, and I believe are closer to whatever "understanding" means if it could be formally defined. It would be quite a coincidence if GPT-4 happened to solve the riddle merely by virtue of "Moonling" and "cabbage" being closely-located vectors.

eximius · on May 25, 2023

Eh. I still consider them stochastic parrots. My concessions lie elsewhere, primarily in the vocabulary.

We refer to algorithms like quicksort as 'reasoning' about the input. So it's fine to use the same sense of the word to apply to stochastic parrots.

The difference between an LLM learning how to sort things and compiling an implementation of an algorithm like quicksort is not terribly large, from a certain perspective.

I suppose something I'm interested in is whether an LLM that can't sort numbers could be instructed how as a prompt and then do so.

There are some examples of similar phenomenon (the one with some kids made up language was interesting) which suggests the LLMs have a lot of space dedicated towards dynamic pattern selection in their context windows (somewhat tautological) in order to have prompts tune the selection for other layers.

And, of course, lack of plasticity is really interesting.

jbay808 · on May 25, 2023

I just can't imagine how a stochastic parrot could repeat back a correctly-sorted list that it hasn't seen in training, without actually implementing a sorting algorithm in the process. It seems (and, by calculation, is) phenomenally unlikely that it would just stochastically happen to pick every single number correctly.

When that is combined with the fact that transformers provably can implement proper deterministic sorting algorithms, it seems that the benefit of the doubt should go to the transformer having learned a sorting algorithm?

LLMs aren't plastic in the sense that they don't learn anything when they aren't being trained. But they can be trained to execute different programs depending on the contents of the context window, like if it contains "wrong, try again:" so maybe they can learn from their mistakes in that sense.

But if you could teach an LLM to sort by explaining it in the context window, the network would already have necessarily learned and stored a sorting algorithm somewhere; the text "here is how sorting is done: [...]" would just be serving as the trigger for that function call.

eximius · on May 25, 2023

Again, I think the disagreement is not whether it has learned to approximate a sorting algorithm, but whether that qualifies as reasoning and, if it does, in what sense.

jbay808 · on May 25, 2023

I won't take a hard stance on what counts as "reasoning", which I picked in the title for lack of a better summarizing word; I am open to alternatives. So if you think that making abstractions and implementing a sorting algorithm does not count as reasoning, I will not disagree with that position. Where I am going to take a hard stance is on what does a stochastic parrot cannot do. And a stochastic parrot, defined as "stitching together sequences of linguistic forms it has observed in its vast training data, according to probabilistic information about how they combine, but without any reference to meaning", cannot sort lists of 127 characters.

hnfong · on May 26, 2023

Given that so often the people claiming GPT4 is a stochastic parrot don't understand what stochastic parrots are, it can be said that they are the stochastic parrots themselves.

TeMPOraL · on May 25, 2023

> We refer to algorithms like quicksort as 'reasoning' about the input. So it's fine to use the same sense of the word to apply to stochastic parrots.

That's an interesting take, because I wouldn't call quicksort itself to be "reasoning". It's a step-by-step algorithm. Once a human learns it, accepts it as correct, and then runs it in their thought-space in order to transform some thought-space structure by sorting, only then I'd call it an exercise of reasoning. Note here that for humans, running quicksort is generally a slow, bug-prone, step-by-step Turing machine emulation in the conscious layer. Maaaaaybe after doing this enough, your subconscious layer will get a feel for it and start executing it for you faster.

The reason I'm saying it is that:

> I suppose something I'm interested in is whether an LLM that can't sort numbers could be instructed how as a prompt and then do so.

I think if you could describe a quicksort-equivalent algorithm to an LLM, one that does things LLM can't tackle directly, and it proceeded to execute that algorithm - I'd give it the same badge of "exercise reasoning" as I'd give to a human.

I think GPT-4 is very much capable of this for simple enough algorithms, but the way it looks like is, you need to get it to spell out individual steps (yes, this is the "chain of thought" "trick"). In my eyes, GPT-4 is playing part of our inner voice - the language-using process bridging subconscious and conscious levels. So if you want it to do equivalent of conscious reasoning, you need to let it "talk it out loud" and have it "hear" itself, the same way a human stepping an algorithm in their head will verbalize, or otherwise keep conscious awareness off, the algorithm description, and the last few steps they've executed.

With this set up, LLMs will still make mistakes. But so do humans! We call this "losing focus", "brain farts", "forgetting to carry one" or "forgetting to carry over the minus sign", etc. Humans can also cheat, off-loading parts of the process to their subconscious, if it fits some pattern they've learned. And so can LLMs - apparently, GPT-4 has a quite good feel for Python, so it can do larger "single steps" if those steps are expressed in code.

The main difference in the above comparison is, indeed, plasticity. Do the exercise enough times, and humans will get better at it, by learning new patterns that subconscious level can execute in one step. LLMs currently can't do that - but that's more of an interface limitation. OpenAI could let GPT-4 self-drive its fine-tuneing based on frequently seen problems, but at this point in time, it would likely cost a lot and wouldn't be particularly effective. But we can only interact with a static, stateless version of the model. But hey, maybe one of the weaker, cheaper, fine-tuneable model is already good enough someone could test this "plasticity by self-guided fine-tuning" approach.

FWIW, I agree with GP/author on:

> the embeddings generated by an LLM go beyond the normal word2vec picture that most people have of embeddings, and I believe are closer to whatever "understanding" means if it could be formally defined.

In fact, my pet hypothesis is that the absurd number of dimensions LLM latent spaces allow to encode any kind of semantic similarity we could think of between tokens, or groups of tokens, as spatial proximity along some subset of dimensions - and secondly, that this is exactly how "understanding" and "abstract reasoning" works for humans.