Hacker News new | past | comments | ask | show | jobs | submit login

Humans learn from the structure of the world -- not the structure of language.

LLMs cheat at generating text because they do so via a model of the statistical structure of text.

We're in the world, it is us who stipulate the meaning of words and the structure of text. And we stipulate new meanings to novel parts of the world daily.

What else is an 'iPhone' etc. ? There's nothing in `i P h o n e` which is at all like an iphone.

We have just stipulated this connection. The machine replays these stipulations to us -- it does not make them, as we do.




There's nothing qualitatively less "in the world" about a language model than a human. Yes, a human has more senses, and is doubtless exposed to huge categories of training data that a language model doesn't have access to - but it's false to draw a sharp dichotomy between knowing what an iPhone looks like, and knowing how people talk about iPhones.

Consider two people - one, a Papau New Guinea tribesperson from a previously uncontacted tribe who is allowed to handle a powered-down iPhone, and told it is an "iPhone", but is otherwise ignorant of its behavior - the other, a cross-platform mobile software developer who has never actually held a physical iPhone, but is intimately familiar with its build systems, API, cultural context etc. Between the two of them, who better understands what an iPhone "is"?

You make a good point about inventing words to refer to new concepts. There's nothing theoretically stopping a language model from identifying some concept in its training data that we don't have a word for, inventing a word for it, and using it to give us a perspective we hadn't considered. It would be very useful if it did that! I suspect we don't tend to see that simply because it's a very rare occurrence in the text it was trained on.


LLMs don't have any senses, not merely fewer. LLMs don't have any concepts, not merely named ones.

A concept is a sensory-motor technique abstracted into a pattern of thought developed by an animal, in a spatio-temporal environment, for a purpose.

LLMs are just literally an ensemble of statistical distributions over text symbols. In generating text, they're just sampling from a compressed bank of all text ever digitised.

We aren't sampling from such a bank, we develop wholey non-linguistic concepts which describe the world, and it is these which language piggy-backs on.

The structure of symbols in a book has nothing to do with the structure of the world -- it is we who have stipulated their meaning: there's no meaning to `i`


> A concept is a sensory-motor technique abstracted into a pattern of thought developed by an animal, in a spatio-temporal environment, for a purpose.

Hi, since human linguistics is the sole repository of linguistic conceptualism, can you please show me which of the neurons is the "doggie" neuron, or the "doggie" cluster of neurons? I want to know which part of the brain represents the thing that goes wag-wag.

If you can't mechanically identify the exact locality of the mechanism within the system, it doesn't really exist, right? It's just a stochastic, probabilistic model, humans don't understand the wag-wag concept, they just have some neurons that are weighted to fire when other neurons give them certain input stimuli tokens, right?

This is the fundamental problem: you are conflating the glue language with the implementation language in humans too. Human concepts are a glue-language thing, it's an emergent property of the C-language structure of the neurons. But there is no "doggie" neuron in a human just like there is no "doggie" neuron in a neural net. We are just stochastic machines too, if you look at the C-lang level and not the glue-language level.


There's a pile of work on multimodal inputs to LLMs, generally finding that less training data is needed as image (or other) data is added to training.

Text is an extremely limited input stream, but an input stream nonetheless. We know that animal intelligence works well enough with any of a range of sensory streams, and different levels of emphasis on those streams - humans are somehow functional despite a lack of ultrasonic perception and primitive sense of smell.

And your definition of a concept is quite self-serving... I say that as a mathematician familiar with many concepts which don't map at all to sensory motor experiences.


Then why the fondness for chalk?

Sensory-motor expression of concepts is primitive, yes, they become abstracted --- and yes the semantics of those abstractions can be abstract. I'm not talking semantics, i'm talking genesis.

How does one generate representations whose semantics are the structure of the world? Not via text token frequency, this much is obvious.

I dont think the thinnest sense of "2 + 2 = 4" being true is what a mathematician understands -- they understand, rather, the object 2, the map `+` and so on. That is, the proposition. And when they imagine a sphere of radius 4 containing a square of length 2, etc. -- I think there's a 'sensuous, mechanical, depth' that enables and permeates their thinking.

The intellect is formal only in the sense that, absent content, it has form. That content however is grown by animals at play in their environment.


LLMs have two senses, time and text


> Consider two people - one, a Papau New Guinea tribesperson who is allowed to handle a powered-down iPhone, and told it is an "iPhone", but is otherwise ignorant of its behavior - the other, a cross-platform mobile software developer who has never actually held a physical iPhone, but is intimately familiar with its build systems, API, cultural context etc. Between the two of them, who better understands what an iPhone "is"?

But then also consider the following: a human being from 2006, and an LLM that has absorbed an enormous corpus of words about iPhones that is also granted access to a capacitive-touchscreen friendly robot arm and continuous feed digital camera (and since I'm feeling generous, also a lot of words about the history and architecture of robot arms and computer vision). There is no doubt the LLM will completely blow the human out of the water if asked trivia questions about the iPhone and its ecosystem.

But my money's on the 2006 human doing a lot better at switching it on and using the Tinder app...


No doubt. I don't think anyone's arguing that LLMs have richer, deeper understanding of anything just yet. On the other hand I also don't think it would prove much to vaguely connect a language model to a robot arm and then ask it to do non-language tasks.


  > Humans learn from the structure of the world -- not the structure of language.
You'd be surprised. Many researchers believe that "knowledge" is inseparable from language, and that language is not associative (labels for the world) but relational. For example, in Relational Frame Theory, human cognition is dependent on bidirectional "frames" that link concepts, and those frames are linguistic in nature. LLMs develop internal representations of those frames and relations, which is why they can tell you that a pool is bigger than a cup of water, and which one you would want to drink.

In short, there's no evidence that being in the world makes our knowledge any different from an LLM. The main advantages we have at the moment are sensory learnings (LLMs are not good at comparing smells and flavors) and the ability to continuously train our brains.


The cooccurrent frequency between text tokens in everything ever written is a limited statistical model of however language is used in humans.

It almost doesnt matter what your theory of language is --- any even plausible account will radically depart from the above statistical model. There isn't any theory of language which supposes it's an induction across text tokens.

The problem in this whole discussion is that we know what these statistical models are (models of association in text tokens) -- yet people completely ignore this in favour of saying "it works!".

Well "it works" is NOT an explanatory condition, indeed, it's a terrible one. If you took photographs of the night sky for long enough, you'd predict where all the stars are --- these photos do not employ a theory of gravity to achive these.

LLMs are just photographs of books.

There's a really egregious pseudoscience here that the hype-cycle completely suppresses: we know the statistical form of all ML models. We know that via this mechanism arbitrarily accurate predictions, given arbitrarily relevant data, can be made. We know that nothing in this mechanism is explanatory.

This is trivial. If you video tape everything and play it back you'll predict everything. Photographing things does not impart those photographs the properties of those things -- those serve as a limited assocative model.


Exactly. A very uncomfortable truth for those heavily invested (time/money/credence) in this latest AI wave.


It’s odd to see people doomwaving two general reasoning engines.

It’s especially hard to parse a dark sweeping condemnation based on…people are investing in it? It doesn’t have the right to assign names to things? Idk what the argument is.

My most charitable interpretation is “it cant reason abour anything unless we already said it” which is obviously false.


> one of which is an average 14 year old, the other an honors student college freshman

The point is that they're not those things. Yes, language models can produce solutions to language tests that a 14 year old could also produce solutions for, but a calculator can do the same thing in the dimension of math - that doesn't make a calculator a 14 year old.


Yes, the AI isn’t literally a 14 year old, and we should do an anthromorphization acknowledgement. Thank you for pointing it out, it can waste a lot of time when you get sloppy with language in AI discussions.

I removed the reference, in retrospect, it’s unnecessary. No need to indicate the strong performance, we’re all aware.


You may not have said it directly but implied, for example if we said A to B, and B to C, the model would have learned the relation and tell you A will go to C, doesn't mean all the sudden it can reason. It's all already in the language and when it has learned enough of numerous forms of A to B, B to C, the relation it's built makes it to give A to C. Yet A to C may very well be some epiphany that we have never thought about. One advantage is the model never get sloppy, it remembers everything, it may overreact/overthink hence hallucination, but it doesn't overlook things or bias like human do (until alignment of course). This is why we're often surprised by the model, but we probably knew it too jut being blind about certain things sometimes so never made the connection.


Very surprised to see these confident assertions still


The heavy investment is what makes this truth uncomfortable - it does not make this truth true (or false).

The point is not so much that we already said it, more that the patterns it encodes and surfaces when prompted are patterns in the written corpus, not of the underlying reality (which it has never experienced). Much like a list of all the addresses in the US (or wherever) will tell you very little about the actual geography of the place.


>not of the underlying reality (which it has never experienced).

You've never experienced the "underlying reality" either.


Sure you did, all animal do. Without language, human would live just fine, evidently all animal live this way, deaf people can live, can reason, can triage, may not be sophisticated but they all the underlying reality in their heads, probably gained from try and fail, experiences.


>Humans learn from the structure of the world -- not the structure of language.

No we don't. Humans don't experience or perceive reality. We perceive a nice modification of it and that's after excluding all sense data points we simply aren't capable of perceiving at all.

Your brain is constantly shifting and fabricating sense data based on internal predictions and that form the basis of what you call reality. You are not learning from the structure of the world. You are learning from a simplified model of it that is fabricated at parts.


And how language got anything to do with it?


Structure in the “world”? You mean the stream of “tokens” we ingest?

This just comes down to giving transformers more modalities, not just text tokens.

There is nothing about “2” that conveys any “twoness”, this is true of all symbols.

The token “the text ‘iphone’” and the token “visual/tactile/etc data of iphone observation” are highly correlated. That is what you learn. I don’t know if you call that stipulation, maybe, but an LLM correlates too in its training phase. I don’t see the fundamental difference, only a lot of optimizing and architectural improvements to be made.

Edit: and when I say “a lot”, I mean astronomical amounts of it. Human minds are pretty well tuned to this job, it’ll take some effort to come close.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: