Recent results show that LLMs struggle with compositional tasks

moolimon · 2025-02-02T06:10:31 1738476631

The main thesis here seems to be that LLMs behave like almost all other machine learning models, in that they are doing pattern matching on their input data, and short circuiting to a statistically likely result. Chain of thought reasoning is still bound by this basic property of reflexive pattern matching, except the LLM is forced to go through a process of iteratively refining the domain it does matching on.

Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.

Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

nonameiguess · 2025-02-02T11:59:42 1738497582

To me:

LLMs are trained, as others have mentioned, first to just learn the language at all costs. Ingest any and all strings of text generated by humans until you can learn how to generate text in a way that is indistinguishable.

As a happy side effect, this language you've now learned happens to embed quite a few statements of fact and examples of high-quality logical reasoning, but crucially, the language itself isn't a representation of reality or of good reasoning. It isn't meant to be. It's a way to store and communicate arbitrary ideas, which may be wrong or bad or both. Thus, the problem for these researchers now becomes how do we tease out and surface the parts of the model that can produce factually accurate and reasonable statements and dampen everything else?

Animal learning isn't like this. We don't require language at all to represent and reason about reality. We have multimodal sensory experience and direct interaction with the physical world, not just recorded images or writing about the world, from the beginning. Whatever it is humans do, I think we at least innately understand that language isn't truth or reason. It's just a way to encode arbitrary information.

Some way or another, we all grok that there is a hierarchy of evidence or even what evidence is and isn't in the first place. Going into the backyard to find where your dog left the ball or reading a physics textbook is fundamentally a different form of learning than reading the Odyssey or the published manifesto of a mass murderer. We're still "learning" in the sense that our brains now contain more information than they did before, but we know some of these things are representations of reality and some are not. We have access to the world beyond the shadows in the cave.

anon84873628 · 2025-02-02T17:46:44 1738518404

Humans can carve the world up into domains with a fixed set of rules and then do symbolic reasoning within it. LLMs can't see to do this in a formal way at all -- they just occasionally get it right when the domain happens to be encoded in their language learning.

You can't feed an LLM a formal language grammar (e.g. SQL) then have it only generate results with valid syntax.

It's awfully confusing to me that people think current LLMs (or multi-modal models etc) are "close" to AGI (for whatever various definitions of all those words you want to use) when they can't do real symbolic reasoning.

Though I'm not an expert and happy to be corrected...

cornel_io · 2025-02-02T20:04:12 1738526652

Adult humans can do symbolic reasoning, but lower mammals cannot. Even ones that share most of our brain structure are much worse at this, if they can do it at all; children need to learn it, along with a lot of the other things that we consider a natural part of human intelligence.

That all points towards symbolic reasoning being a pretty small algorithmic discovery compared to the general ability to pattern match and do fuzzy lookups, transformations, and retrievals against a memory bank. It's not like our architecture is so special that we burned most of our evolutionary history selecting for these abilities, they're very recent innovations, and thus must be relatively simple, given the existence of the core set of abilities that our close ancestors have.

The thing about transformers is that obviously they're not the end of the line, there are some things they really can't do in their current form (though it's a smaller set than people tend to think, which is why the Gary Marcuses of the world always backpedal like crazy and retcon their previous statements as each new release does things that they previously said were impossible). But they are a proof of concept showing that just about the simplest architecture that you could propose that might be able to generate language in a reasonable way (beyond N-gram sampling) can, in fact, do it really, really well even if all you do is scale it up, and even the simplest next-token prediction as a goal leads to much higher level abilities than you would expect. That was the hard core of the problem, building a flexible pattern mimic that can be easily trained, and it turns out to get us way further along the line to AGI than I suspect anyone working on it ever expected it would without major additions and changes to the design. Now it's probably time to start adding bits and bobs and addressing some of the shortcomings (e.g. static nature of the network, lack of online learning, the fact that chains of thought shouldn't be constrained to token sequences, addressing tokenization itself, etc), but IMO the engine at the heart of the current systems is so impressively capable that the remaining work is going to be less of an Einstein moment and more of an elbow grease and engineering grind.

We may not be close in the "2 years of known work" sense, but we're certainly not far in the "we have no idea how to prove the Riemann Hypothesis" sense anymore, where major unknown breakthroughs are still required which might be 50+ years away, or the problem might even be unsolvable.

mnky9800n · 2025-02-02T09:17:05 1738487825

Humans often do not have a clear definition of success and instead create a post-hoc narrative to describe whatever happened as success.

idiotsecant · 2025-02-02T16:00:33 1738512033

Yes, I've always thought that LLMs need the equivalent of a limbic system. This is how we solved this problem in organic computers. There is no static 'reward function'. Instead, we have a dynamic reward function computer. It decides from day to day and hour to hour what our basic objectives are. It also crucially handles emotional 'tagging' of memory. Memories that we store are proportionally more likely to be retrieved under similar emotional conditions. It helps to filter relevant memories, which is something LLMs definitely could use.

I think the equivalent of an LLM limbic system is more or less the missing piece for AGI. Now, how you'd go about making one of those I have no idea. How does one construct an emotional state space?

_heimdall · 2025-02-02T12:39:18 1738499958

Companies are bad about doing this on purpose. If they set out to build AGI and accomplish something novel, just call that AI and go on fund raising from people who don't know better (or more likely don't care and just want to gamble with others' money).

cadamsdotcom · 2025-02-02T09:40:04 1738489204

Continuous RL in a sense. There maybe an undiscovered additional scaling law around models doing what you describe; continuous LLM-as-self-judge, if you will.

Provided it can be determined why a user ended the chat, which may turn out to be possible in some subset of conversations.

ben_w · 2025-02-02T11:59:40 1738497580

And also sometimes write down the conclusion and work backwards, without considering that the reason most likely for the conclusion isn't necessarily going to have the conclusion as the most likely conclusion — I hope I phrased that broken symmetry correctly.

ahartmetz · 2025-02-02T09:40:11 1738489211

I'm not following. Do you have an example?

ben_w · 2025-02-02T12:15:10 1738498510

Aesop's fables, "sour grapes".

ahartmetz · 2025-02-02T19:04:29 1738523069

I did think of sour grapes (only thing that came to my mind) and was hoping for something better. Sour grapes doesn't seem too interesting. I think most people can tell you, if you actually ask them, what the differences between their actual successes and their wildest dreams are. But any improvement is still a success and I think that's valid.

mnky9800n · 2025-02-02T11:13:32 1738494812

The milliken oil drop experiment, “winning “ the space race, mostly anything C levels will tell the board and shareholders at a shareholder meeting, the American wars in Iraq and Afghanistan, most of what Sam Altman or Elon musk has to say, this list continues.

Yiin · 2025-02-02T12:00:52 1738497652

I think you're approaching it form very high level, when you should think about it from much lower level, i.e. success is being determined by stress/dopamine hormones or similar

d0mine · 2025-02-02T12:19:24 1738498764

The lower level seems to work eg, “Dopamine regulates decision thresholds in human reinforcement learning” https://www.nature.com/articles/s41467-023-41130-y

Splinelinus · 2025-02-02T19:00:49 1738522849

This article is kind of vague on that tbf: To conclude, we observed no credible evidence for a beneficial effect of L-dopa (vs. Haloperidol) on reinforcement learning in a reward context, as well as the proposed mechanistic account of an enhanced striatal prediction error response mediating this effect.

mnky9800n · 2025-02-02T12:22:29 1738498949

So your claim is that success stories are some kind of emergent phenomenon that comes from people chasing dopamine highs?

echoangle · 2025-02-02T17:45:20 1738518320

Is that controversial? I would say everything a human does is to feel better, and everything someone does that doesn’t make them feel better immediately is just done in the expectation of even greater pleasure later.

mnky9800n · 2025-02-02T19:59:34 1738526374

I don’t think the human experience can be reduced down to “if feels good is true then continue else try something else”. But I could be wrong.

echoangle · 2025-02-02T20:13:25 1738527205

Well mine can, with some tactics and strategy layered on top. If I do something I don’t like, I only do it because the payoff later makes it worth it (or at least I think it will from my current knowledge).

earnestinger · 2025-02-02T23:11:24 1738537884

While humans seek “profit”.

It is important that “profit”, comes in various forms, which exchange rates are problematic to calculate (or maybe there can’t be any): not hungry, not thirsty, tastes good, not cold, feel safe, feel excited, feel righteous, feel powerful, listen to music, watch a movie, get curious, satisfy curiosity, laugh, love, sex, rock n roll.

dutchbookmaker · 2025-02-02T14:30:41 1738506641

I think it is both and this is very obvious if you read Pareto and Sapolsky.

dutchbookmaker · 2025-02-02T14:29:30 1738506570

It is more than just success too.

Most behavior we believe is some kind of rational action when it is really blind actions based on fiction or just completely random with rationalizations for the behavior after the fact.

mnky9800n · 2025-02-05T11:50:28 1738756228

someone once told me

> there are the facts we know and the stories we tell ourselves

and that really changed my outlook on life, the way i think about things, and basically everything else.

emsign · 2025-02-02T11:34:07 1738496047

This!

viccis · 2025-02-02T21:15:45 1738530945

>Interesting to think about what structures human intelligence has that these models don't.

Kant's Critique of Pure Reason has been a very influential way of examining this kind of epistemology. He put forth the argument that our ability to reason about objects comes through our apprehension of sensory input over time, schematizing these into an understanding of the objects, and finally, through reason (by way of the categories) into synthetic a priori knowledge (conclusions grounded in reason rather than empiricism).

If we look at this question in that sense, LLMs are good at symbolic manipulation that mimics our sensibility, as well as combining different encounters with concepts into an understanding of what those objects are relative to other sensed objects. What it lacks is the transcendental reasoning that can form novel and well grounded conclusions.

Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.

corimaith · 2025-02-02T23:58:27 1738540707

>Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.

This just goes back into the problems of that AI winter again though. First Order Logic isn't expressive enough to model the real world, while Second Order Logic dosen't have a complete proof system to truly verify all it'sstatements, and is too complex and unyieldy for practical uses. The number of people I would also imagine that are working on such problems would be very few, this isn't engineering that it is analytic philosophy and mathematics.

viccis · 2025-02-03T02:03:49 1738548229

Kant predates analytical philosophy and some of its failures (the logical positivism you are referring to). The idea here is that first order logic doesn't need to be expressive enough to model the world. Only that some logic system is capable of modeling the understanding of a representation of the world mediated by way of perception (via the current multimodal generative AI models). And finally, it does not need to be complete or correct, just equivalent or better than how our minds do such.

drakenot · 2025-02-02T06:41:09 1738478469

With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.

It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.

mountainriver · 2025-02-03T00:11:17 1738541477

It is a reward function it’s just a deterministic one. Reward models are often hacked preventing real reasoning from being discovered

huijzer · 2025-02-02T08:21:18 1738484478

> Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

What is also a benefit for humans, I think, is that people are typically much more selective. LLMs train to predict anything on the internet, so for example for finance that includes clickbait articles which have a lifetime of about 2 hours. Experts would probably reject any information in these articles and instead try to focus on high quality sources only.

Similarly, a math researcher will probably have read a completely set of sources throughout the life than, say, a lawyer.

I’m not sure it’s a fundamental difference, but current models do seem to not specialize from the start unlike humans. And that might be in the way of learning the best representations. I know from ice hockey for example, that you can see within 3 seconds whether someone played ice hockey from young age or not. Same with language. People can usually hear an accent within seconds. Relatedly, I've used OpenAI's text to speech a while back and the Dutch voice had an American accent. What this means is that even if you ask LLMs about Buffett's strategy, maybe they have a "clickbait accent" too. So with the current approach to training, the models might never reach absolute expert performance.

andai · 2025-02-02T09:04:42 1738487082

When I was doing some NLP stuff a few years ago, I downloaded a few blobs of Common Crawl data, i.e. the kind of thing GPT was trained on. I was sort of horrified by the subject matter and quality: spam, advertisements, flame wars, porn... and that seems to be the vast majority of internet content. (If you've talked to a model without RLHF like one of the base Llama models, you may notice the personality is... different!)

I also started wondering about the utility of spending most of the network memorizing infinite trivia (even excluding most of the content above, which is trash), when LLMs don't really excel at that anyway, and they need to Google it anyway to give you a source. (Aside: I've heard soke people have good luck with "hallucinate then verify" with RAG / Googling...)

i.e. what if we put those neurons to better use? Then I found the Phi-1 paper, which did exactly that. Instead of training the model on slop, they trained it on textbooks! And instead of starting with PhD level stuff, they started with kid level stuff and gradually increased the difficulty.

What will we think of next...

dr_dshiv · 2025-02-02T09:30:33 1738488633

Yes, but the PHI-1 textbooks were synthetic — written by other models! So…

astrange · 2025-02-03T12:41:39 1738586499

You can get rid of the trivia by training one model on the slop, then a second model on the first one - called distillation or teacher-student training. But it's not much of a problem because regularization during training should discourage it from learning random noise.

The reason LLMs work isn't because they learn the whole internet, it's because they try to learn it but then fail to, in a useful way.

If anything current models are overly optimized away from this; I get the feeling they mostly want to tell you things from Wikipedia. You don't get a lot of answers that look like they came from a book.

gf000 · 2025-02-02T22:51:31 1738536691

I don't know, babies hear a lot of widely generic topics from multiple people before learning to speak.

I would rather put it that humans can additionally specialize much more, but we usually have a pretty okay generic understanding/model of a thing we consider as 'known'. I would even wager that being generic enough (ergo, has been sufficiently abstracted) is possibly the most important "feature" human's have? (In the context of learning)

ben_w · 2025-02-02T12:45:39 1738500339

> For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.

What we can integrate, we seem to integrate efficiently*; but compared to the quantities used to train AI, we humans may as well be literally vegetables.

* though people do argue about exactly how much input we get from vision etc., personally I doubt vision input is important to general human intelligence, because if it was then people born blind would have intellectual development difficulties that I've never heard suggested exist — David Blunket's success says human intelligence isn't just fine-tuning on top of a massive vision-grounded model.

Retric · 2025-02-02T13:26:08 1738502768

Hearing is also well into the terabytes worth of information per year. Add in touch, taste, smell, proprioception, etc and the brain gets a deluge.

The difference is we’re really focused on moving around in 3D space and more abstract work, where an LLM etc is optimized for a very narrow domain.

hammock · 2025-02-02T19:32:48 1738524768

> Hearing is also well into the terabytes worth of information per year. Add in touch, taste, smell, proprioception, etc and the brain gets a deluge

Is that supposed to be a lot? Only a small fraction of that is committed to permanent storage.

A random server is today processing anywhere from tens of terabytes to hundreds of petabytes annually

Retric · 2025-02-02T20:10:35 1738527035

Low level details like that aren’t relevant to this discussion. Most human processing power is at the cellular level. The amount of processing power in a single finger literally dwarfs a modern data center, but we can’t leverage that to think only live.

So it’s not a question of ‘a lot’ it’s a question of orders of magnitude vs “the quantities used to train AI”

Library of congress has what 39 million books, tokenize every single one and you’re talking terabytes of training data for an LLM. We can toss blog posts etc to that pile but every word ever written by a person isn’t 20 orders of magnitude larger or anything.

jdietrich · 2025-02-02T16:40:08 1738514408

>Hearing is also well into the terabytes worth of information per year.

If we assume that the human auditory system is equivalent to uncompressed digital recording, sure. Actual neural coding is much more efficient, so the amount of data that is meaningfully processed after multiple stages of filtering and compression is plausibly on the order of tens of gigabytes per year; the amount actually retained is plausibly in the tens of megabytes.

Don't get me wrong, the human brain is hugely impressive, but we're heavily reliant on very lossy sensory mechanisms. A few rounds of Kim's Game will powerfully reveal just how much of what we perceive is instantly discarded, even when we're paying close attention.

Retric · 2025-02-02T17:51:06 1738518666

The sensory information form individual hairs in the ear start off with a lot more data to process than simple digital encoding of two audio streams.

Neural encoding isn’t particularly efficient from a pure data standpoint just an energy standpoint. A given neuron not firing is information and those nerve bundles contain a lot of neurons.

PaulDavisThe1st · 2025-02-02T17:56:21 1738518981

"That's a coyote" I say, based on auditory evidence.

Pretty damn efficient coding.

gf000 · 2025-02-02T23:03:29 1738537409

Is that a positive thing? If anything I would consider that as the reverse - LLMs have the "intelligence of vegetables" because even with literally the whole of human written knowledge they can at most regurgitate that back to us with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.

But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.

ben_w · 2025-02-03T00:04:04 1738541044

> Is that a positive thing?

Neither. Both.

Depends what you want to measure.

It's perfectly legit to call these models "thick" because they *need* to read such a vast quantity of text that a human would literally spend two thousand lifetimes to go through it even if that was all the human did with their days.

It also remains the case that, unlike us, they can go through all of that in a few months.

> with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.

You're either grossly underestimating AI or overestimating 2 year olds, possibly both.

I just about remember being a toddler, somewhere between then and 5 was around the age I had the idea that everyone got an invisible extra brain floating next to them for every year they lived. Took me an embarrassingly long time (teens, IIRC) to realise that the witch-duck-weight-comparison scene in Monty Python and the Holy Grail wasn't a documentary, thanks to the part of the film captioned "Famous Historian". One time my dad fell ill, and he was talking to mum about "the tissue being damaged" while I was present, so I gave him a handkerchief (AKA "a tissue"). And while I don't remember this directly, my mum's anecdotes include me saying "fetrol fump", waving a spoon in a jam pan and calling this act "spelling", and when discovered running around with my pockets inside-out explaining myself as trying to fly because I apparently thought that the lining of a pocket was called a "wing".

When it comes to human novelty, I also quite often find there's a lot of remixing going on that just isn't immediately apparent. As Steve Jobs apparently once said, “Good artists copy; great artists steal.”, except Jobs stole that quote from Picasso.

It's easy to categorise different levels with AI, but which one of these counts as "novelty", and how often do humans ever achieve each of these grades?

0. Memorisation of the training set. Think: bunch of pictures, pick best fit.

1. Linear interpolation between any pair of elements in the training set. Think: simple cross-fade between any two pictures, but no tracking or distorting of features during that fade.

2. Let the training set form a basis vector space, and interpolate freely within the constraints of the examples. Think: if these pictures are faces, it would make any hair colour between the most extreme limits shown, etc.

3. Extrapolate beyond the examples. Think: Even if no black or white hair was visible, so long as several shades of grey were, it could reach the ideas of black or white hair.

4. Invent a new vector. Think: even if it had been trained only on black-and-white images, it could still invent green hair.

> But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.

We do see them doing *tiny* scientific achievements, with extra emphasis on "tiny". Just like with using them in software, even the best "only" act like fresh graduates.

When any AI gets to high-impact… the following (fictional) quote comes to mind: "as soon as we started thinking for you, it really became our civilization."

gf000 · 2025-02-03T07:16:36 1738566996

> that a human would literally spend two thousand lifetimes to go through it even if that was all the human did with their days.

Well, `cp` would go over that data even faster, but depending on what retention/conclusion is reached from that it may or may not be impressive.

Humans are fundamentally limited by our biology, and rotating a tiny sphere and turning pages and serial processing does make certain hard limits on us.

A two years old can definitely say stupid stuff, or have wildly incomplete/incorrect models of their reality, but can most certainly already think and reason, and update their internal models at any point.

> Tiny scientific achievements, only acting as fresh graduates with regards to software

I don't believe they are anywhere close to being as good at software as a fresh graduate. Sure, many people write terrible code, and there are a lot of already solved problems out there (not even just solved, but solved thousands times) - LLMs are definitely a novel tool when it comes to finding information based on some high-ish level patterns (over exact string match, or fuzzy match), and they are very good at transforming between different representations of said data, with minimal (and hard limited) reasoning capabilities, but I have never seen evidence of going any further than that.

I don't think your grades are "correct" - e.g. a random generator can easily create new vectors, but I wouldn't call that intelligence. Meanwhile, that two years old can do a novel discovery from their POV every couple of day, potentially turning around their whole world model each day. To me, that sounds way "cooler" than a statistically likely token given these previous tokens, and LLMs definitely need some further structure/architecture to beat humans.

--

I do like your last quote though, and definitely agree there!

ben_w · 2025-02-04T14:31:05 1738679465

> Well, `cp` would go over that data even faster, but depending on what retention/conclusion is reached from that it may or may not be impressive.

Sure, but it would be a level zero on that list, right?

I'd say even Google would be #0.

> A two years old can definitely say stupid stuff, or have wildly incomplete/incorrect models of their reality, but can most certainly already think and reason, and update their internal models at any point.

I think that this presumes a certain definition of "think" and "reason". Monsters under the bed? To move from concrete examples to the abstract, from four apples to the idea of four?

Imagine a picture of a moon's orbit around the parent planet and the planet's orbit around a star, first at one time of year, then again 60° later, the circular orbits of each drawn clearly, with the two positions of the moon's orbits aligned at the top of the image; exaggerate the scale for clarity, and find it in an astronomy book — my peers at age 6 or 7 thought it was a picture of a mouse.

Imagine teachers and an ambulance crew explaining to the class how blood is donated, showing that they're putting a bag up the teachers sleeves and explaining how they'll demonstrate this by taking "blood" (fake? No idea at this point) from that bag. Everyone's looking, we see it go up the sleeve. We see the red stuff come out. Kid next to me screams "they're killing her!". Rather than say "we literally saw the bag go up the sleeve", 5-year-old-me tried to argue on the basis that killing a teacher in front of us was unlikely — not wrong, per say, but a strange argument and I wondered even at the time why I made it.

Are these examples of "reason"? Could be. But, while I would say that we get to the "children say funny things" *with far fewer examples than the best AI*, it doesn't seem different in kind to what AI does.

> LLMs are definitely a novel tool when it comes to finding information based on some high-ish level patterns (over exact string match, or fuzzy match), and they are very good at transforming between different representations of said data, with minimal (and hard limited) reasoning capabilities, but I have never seen evidence of going any further than that.

Aye. So, where I'm going with #2 and #3: even knowing what the question means well enough to respond by appropriately gluing together a few existing documents correctly, requires the AI to have created a vector space of meaning from the words — the sort of thing which word2vec did. But:

To be able to translate questions into answers when neither the question nor the answer are themselves literally in the training set, requires at least #2. (If it was #1, you might see it transition from "Elizabeth II was Queen of the UK" to "Felipe VI is King of Spain" via a mid-point of "Macron is Monarch of France").

For #3, I've tried the concrete example of getting ChatGPT (free model a few months back now) to take the concept of the difference between a racoon and a wolf and apply this difference again on top of a wolf, and… well, their combination of LLM and image generator gave me what looked like a greyhound, so I'm *not* convinced that OpenAI's models demonstrate this in normal use — but also, I've seen this kind of thing demonstrated with other models (including Anthropic, so it's not a limit of the Transformer architecture) and the models seem to do more interesting things.

Possibly sample bias, I am aware of the risk of being subject to a Clever Hans effect.

For #4, this seems hard to be sure it has happened when it seems to have happened. I don't mean what word2vec does, which I realise now could be described in similar language, as what word2vec does is kinda a precursor to anything at least #1. Rather, what I mean, in a human, would seem like "spots a black swan before it happens". I think the invention of non-Euclidian geometry might count, but even then I'm not sure.

mnky9800n · 2025-02-02T12:28:52 1738499332

I feel like if you take the underlying transformer and apply to other topics, e.g., eqtransformer, nobody questions this assumption. It’s only when language is in the mix do people suggest they are something more and some kind of “artificial intelligence” akin to the beginnings of Data from Star Trek or C3P0 from Star Wars.

lubujackson · 2025-02-02T18:10:06 1738519806

Human processing is very interesting and should likely lead to more improvements (and more understanding of human thought!)

Seems to me humans are very good at pattern matching, as a core requirement for intelligence. Not only that, we are wired to enjoy it innately - see sudoku, find Waldo, etc.

We also massively distill input information into short summaries. This is easy to see by what humans are blind to: the guy in a gorilla suit walking through a bunch of people passing a ball around, or basically any human behavior magicians use to deceive or redirect attention. We are mombarded with information constantly. This is the biggest difference between us and LLMs as we have a lot more input data and also are constantly updating that information - with the added feature/limitation of time decay. It would be hard to navigate life without short term memory or a clear way to distinguish things that happened 10 minutes ago from 10 months ago. We don't fully recall each memory of washing the dishes but junk the vast, vast majority of our memories, which is probably the biggest shortcut our brains have over LLMs.

Then we also, crucially, store these summaries in memory as connected vignettes. And our memory is faulty but also quite rich for how "lossy" it must be. Think of a memory involving a ball from before the age of 10 and most people can drum up several relevant memories without much effort, no matter their age.

arkh · 2025-02-02T13:16:20 1738502180

> Interesting to think about what structures human intelligence has that these models don't.

Pain receptors. If you want to mimic human psyche you have to make your agent want to gather resources and reproduce. And make it painful to lack those resources.

Now, do we really have to mimic human intelligence to get intelligence? You could make the point the internet is now a living organism but does it have some intellect or is it just some human parasite / symbiote?

corimaith · 2025-02-02T23:52:48 1738540368

>Interesting to think about what structures human intelligence has that these models don't.

If we get to the gritty details of what gradient descent is doing, we've got a "frame", i.e a matrix or some array of weights contains the possible solution for a problem, then with another input of weights we're matching a probability distribution to minimize the loss function with our training data to form our solution in the "frame". That works for something like image recognition, where the "frame" is just the matrix of pixels, or in language models where we're trying to find the next word-vector given a preceding input.

But take something like what Sir William Rowan Hamilton was doing back in 1843. He know that complex numbers could be represented in points in a plane, and arthimetic could be performed on them, and now he wanted to extend a similar way for points in a space. With triples it is easy to define addition, but the problem was multiplication. In the end, he made an intuitive jump, a pattern recognition when he realized that he could easily define multiplications used quadruples instead, and thus was born the Quaternion that's a staple in 3D graphics today.

If we want to generalize this kind of problem solving into a way that gradient descent can solve, where do we even start? First of all, we don't even know if a solution is possible or coherent or what "direction" we are going towards. It's not a systematic solution, it's rather one that pattern in one branch of mathematics was recognized into another. So perhaps you might use something like Category Theory, but then how are we going to represent this in terms of numbers and convex functions, and is Category Theory even practical enough to easily do this?

mdp2021 · 2025-02-02T10:46:30 1738493190

> Interesting to think about what structures human intelligence has that these models don't

Chiefly?

After having thought long and hard, building further knowledge on the results of the process of having thought long and hard, and creating intellectual keys to further think long and hard better.

brazzy · 2025-02-02T12:03:09 1738497789

> Interesting to think about what structures human intelligence has that these models don't.

Constant direct feedback from the real world and the ability to continuously integrate it to update the model. That's probably the big one.

My pet theory is that having a body is actually an integral part of intelligence, to provide the above, as well as an anchor for a sense of self

mdp2021 · 2025-02-02T13:20:07 1738502407

> having a body

You do not need sensorial feedback to do math. And you do not need full sensors to have feeback - one well organized channel can suffice for some applications.

astrange · 2025-02-03T12:45:35 1738586735

You need sensor feedback to learn new math and talk to other mathematicians.

mdp2021 · 2025-02-03T14:32:21 1738593141

To learn new math, a professional mathematician foremostly just thinks further (it's its job); to discuss with other entities (and acquire new material), textual input and outputs suffice.

astrange · 2025-02-04T03:54:49 1738641289

Hmm, the ones I know do a fair amount of multimodal discussion. (Shapes for instance.)

mdp2021 · 2025-02-04T07:57:26 1738655846

Which you can still acquire as compact data (like SVG)

Which to machines is direct information, as opposed to noisy data requiring complex processing to extract the relevant content.

whattheheckheck · 2025-02-03T14:40:11 1738593611

Have you ever read about Hellen Keller and her experience before her discovery (being taught) language

brazzy · 2025-02-04T13:54:57 1738677297

Only about how she experienced that discovery.

But then, she lost her sight and hearing at the age of 19 months, so after lots of cognitive development had only taken place.

enugu · 2025-02-02T16:53:46 1738515226

This has been pursued by researchers like Rodney Brooks. See also https://en.wikipedia.org/wiki/Embodied_cognition

asah · 2025-02-02T18:15:22 1738520122

So quadriplegics aren't sentient ?

brazzy · 2025-02-04T14:02:14 1738677734

Your statement, not mine. And I wrote intelligence, not sentience.

People who become quadriplegic as adults (or older children) have already developed intelligence before.

My theory (which I have not researched in any way) implies that someone born fully quadriplegic would be severely impaired in developing intelligence. Sight and hearing are of course also important sources of feedback, the question is whether they are sufficient.

buovjaga · 2025-02-02T13:29:55 1738502995

You might get a kick out of this essay by Robert Epstein from 2016: https://aeon.co/essays/your-brain-does-not-process-informati... (The empty brain - Your brain does not process information, retrieve knowledge or store memories. In short: your brain is not a computer)

drdaeman · 2025-02-02T21:40:35 1738532435

Maybe I misunderstood it, but I feel that it's a weird article, because it fails to establish any vocabulary and then seem to uses words in uncertain ways, as if constructing the narrative by specifically crafting (but never truly explaining/define) some model that's not true, but presenting the argument with significantly expanded scope. Drastically reduced (which is not really correct, but may help me to convey my general impression/feelings only) it's kinda sorta like-ish "we aren't doing it the way our computers do, thus the information processing metaphor is wrong".

Like when talking about that experiment and an image of the dollar bill, it never talks about what's an "image", just states that there wasn't one stored in a brain, in "any sense". And then goes on describing the idea that seem to match the description of a "mental image" from cognitive science.

As I [very naively] get it... Information theory is a field of mathematics. Unlike all those previous concepts like humours, mechanical motions or electric activities, math is here to establish terminology and general principles that don't have to fundamentally change if^W when we learn more. And that's why it got stuck.

enugu · 2025-02-02T16:47:32 1738514852

There is a whole genre of essays like this talking about behaviour in a human specific way. But, I wish they engaged with the notions of the Church-Turing thesis and the Universal Turing Machine which indicates that any behaviour following standard physics principles is in fact computable.

(FWIW, I dont think that humans can be reduced to computing, but the Church-Turing thesis is a powerful counterargument which more biologists and psychologists should engage with).

echoangle · 2025-02-02T17:59:12 1738519152

I stopped reading before reaching 2/3 of it but the start is already strawman after strawman (or misunderstanding to be generous).

I don’t think most people believe the brain is made up of a discrete Processing part that accesses information from a memory part that’s encoded in binary there. But just because the brain doesn’t contain a literal encoding of something in binary doesn’t mean the neurons don’t store the information.

If you download the weights of an LLM, you’re not going to find the text it can output „from memory“ anywhere in the file, but the weights still encode the information and can retrieve it (with some accuracy).

imtringued · 2025-02-02T12:26:50 1738499210

Coming up with a reward model seems to be really easy though.

Every decidable problem can be used as reward model. The only downside to this is that the LLM community has developed a severe disdain for making LLMs perform anything that can be verified by a classical algorithm. Only the most random data from the internet will do!

marxplank · 2025-02-02T15:15:04 1738509304

that would help with decidable problems but would still be not generalisable for problems with non trivial rewards, or ones with none.

astrange · 2025-02-03T12:44:39 1738586679

Reasoning seems to generalize, insofar as o1 and DeepSeek-R1 are better at answering questions than their base models.

spenrose · 2025-02-02T17:41:35 1738518095

I argue that we should start calling them "pattern processors": https://x.com/sampenrose/status/1877200883613659360

PaulDavisThe1st · 2025-02-02T17:53:44 1738518824

Your post on Twitter uses slightly more words than the ones preceding it above to make the exact same point. Was there really any reason to link to it? Why not expand on your argument here?

1vuio0pswjnm7 · 2025-02-03T05:37:19 1738561039

"LLMs are fundamentally matching the patterns they've seen, and their abilities are constrained by mathematical boundaries. Embedding tricks and chain-of-thought prompting simply extends their ability to do more sophisticated pattern matching."

antirez · 2025-02-02T07:02:09 1738479729

LLMs keep showing more and more they are the wonder of AI that we awaited for decades: talking machines that every two months do progresses that two months before were anticipated impossible because of <put here some limit that was actually in the prejudice of the skeptical AI community> (just stochastic parrots, no reasoning possible without symbolic representations, there are no longer tokens, ...)

At the same time, part of the scientific community continues to diminish what was accomplished and the steps that are being made. A few months ago LeCun arrived to tell new researchers to move away from LLMs since they are a dead end: imagine the disservice he made to the surely non-zero folks that followed the advice, putting themselves out of the AI research that matters. (Incidentally, this skepticism of the Meta AI head must have something to do with the fact that Meta, despite the huge efforts allocated, produced the worst LLM among Anthropic, OpenAI, DeepSeek -- I bet Zuckerberg is asking questions lately).

It's very hard to explain this behavior if not by psychological denial.

[EDIT: you can't see the score of this comment, but I can: it's incredible how it goes from 3, to -2, to 1, and so forth. The community is split in two, and it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth]

ramblerman · 2025-02-02T07:10:43 1738480243

I get the sentiment, but I actually think some skepticism in the system is healthy.

Billions are flowing towards LLMS, and Sam Altman will overpromise AGI is just around the corner and the days of jobs are gone to fill his coffers to anyone that will listen.

Additionally if we begin to use these things in real production environments where mistakes matters, knowing the exact limitations is key.

None of this takes away from the fact that these are exciting times.

dr_dshiv · 2025-02-02T09:37:26 1738489046

I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.

So instead of everyone having some enthusiasm and some skepticism, you get a bifurcation where whole classes of people act as the skeptics and others as the enthusiasts. I view the strong skeptics as more “in the wrong” because they often don’t use LLMs much. If you are an actual enthusiastic user, you simply can’t get good performance without a very strong dose of skepticism towards everything LLMs output.

fragmede · 2025-02-02T17:38:55 1738517935

I don't think everyone shares those doubts. The first time you catch an LLM in a lie is sobering, but there are lots of areas, and thus lots of users, for whom it doesn't hallucinate for, because they're asking softball questions and it doesn't end up hallucinating, or hallucinations just really aren't aren't that big a deal. (eg an LLM horoscope generator or using it write sci fi.)

so while we're on HN going back and forth about how outright lies by the system indight the whole thing for everybody, we should be careful to note that it's not for everybody, or rather, it's a known limitation so don't trust it to cite real cases for you as a lawyer, but using it to help you figure out what mens rea means in a practical sense by asking it questions about the concept, totally.

Honestly, hallucinations happen so rarely for me because of the kinds of things I ask it, that it doesn't happen enough for me to not believe it's answers in low-stakes situations, or situations on the level of horoscope generation, and I'm sure I'm not alone in treating ChatGPT that way, despite evidence to the contrary.

eagleislandsong · 2025-02-02T11:54:11 1738497251

> I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.

I'm very curious. If you don't mind taking the time to elaborate, will you give a few examples of such skepticism/naysaying? Thank you.

tessellated · 2025-02-02T15:58:13 1738511893

Some "academia" types I meet seem to be struck in this "skeptic vs enthusiast" discussion.

How do we harness LLMs’ potential while rigorously mitigating harm?

antirez · 2025-02-02T07:16:28 1738480588

Yes there is another part of the community that overhypes everything. But I can expect that from a CEO of an AI company (especially if he is Altman), but from researches? Also the fact that LLMs may reach superhuman expertise in certain fields in a short timeframe (a few years), since reinforcement learning is starting to be applied to LLMs may no longer be a totally crazy position. If it is possible to extend considerably the same approach seen in R1-Zero there could be low hanging fruits around the corner.

comeonbro · 2025-02-02T07:24:30 1738481070

This article is about things which aren't limitations anymore!

You are applauding it as pushback for pushback's sake, but it's an article about limitations in biplane construction, published after we'd already landed on the moon.

suddenlybananas · 2025-02-02T08:39:01 1738485541

Is there any evidence that these fundamental issues with compositionality have been resolved or are you just asserting it? Has the paper been replicated with a CoT model and had a positive result?

dr_dshiv · 2025-02-02T09:43:29 1738489409

Well, yes — because modern models can solve all the examples in the article. The theory of compositionality is still an issue, but the evidence for it recedes.

I think most of the issue comes from the challenge of informational coherence. Once incoherence enters the context, the intelligence drops massively. You can have a lot of context and LLMs can maintain coherence— but not if the context itself is incoherent.

And, informationally, it is just a matter of time before a little incoherence gets into a thread.

This is why agents have so much potential—being able to separate out separate threads of thought in different context windows reduces the likelihood of incoherence emerging (vs one long thread).

Actually, maybe “cybernetic ecologies” are closer to what I mean than “agents.” See Anthropic’s “Building Effective Agents.” https://www.anthropic.com/research/building-effective-agents

anon84873628 · 2025-02-02T17:59:49 1738519189

>I think most of the issue comes from the challenge of informational coherence. Once incoherence enters the context, the intelligence drops massively. You can have a lot of context and LLMs can maintain coherence— but not if the context itself is incoherent.

As a non-expert, part of my definition of intelligence is that the system can detect incoherence, a.k.a reject bullshit. LLMs today can't do that and will happily emit bullshit in response.

Maybe the "gates" in the "workflows" discussed in the Anthropic article are a practical solution to that. But that still just seems like inserting human intelligence into the system for a specific engineering domain; not a general solution.

lowsong · 2025-02-02T14:32:06 1738506726

> …it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth

This is more of a salient point that you perhaps realized. In life there is no single absolute, unknowable truth. Philosophy has spent the entire span of human existence grappling with this topic. The real risk with AI is not that we build some humanity-destroying AGI, but that we build a machine that is 'convincing enough' — and the idea that such a machine would be built by people who believe in objective truth is the most worrying part.

thrance · 2025-02-02T16:26:00 1738513560

Depends, if you're a realist [1] (like most) then there can be such a thing as absolute truth, that you may not always be able to access.

[1] https://en.wikipedia.org/wiki/Philosophical_realism?wprov=sf...

askl56 · 2025-02-02T17:34:47 1738517687

This is teleologically false.

A teleological argument that assumes truth is contingent upon a specific worldview would indeed be flawed, because it would make truth an artifact of a given perspective rather than something independent of it.

mdp2021 · 2025-02-02T13:12:09 1738501929

> At the same time, part of the scientific community continues to diminish what was accomplished

Revisit the idea: part of the public is bewildered by voices that started calling "intelligence" what was and apparently still is the precise implementation of unintelligence. The fault is in some, many people - as usual.

Very recent state-of-the-art LLM models themselves declare that if the majority of their training data states that entity E is red they will say it's red, and if the majority says it's blue then they will say it's blue: that is the implementation of an artificial moron.

And in fact, very recent state-of-the-art LLM models state cretinous ideas that are child level - because "that's what they have heard" (stuck, moreover analytically, in the simplifications intrinsic in expression).

This architectural fault should be the foremost concern.

pera · 2025-02-02T12:13:56 1738498436

Psychological denial of what exactly? And what part of the article/preprints you are commenting on?

Every time an article exposing some limitation of the current wave of LLMs is submitted to HN there are comments like yours and I genuinely cannot understand the point you are trying to make: There is no such thing as a perfect technology, everything has limitations, and we can only improve our current state of the art by studying these and iterate.

rthrfrd · 2025-02-02T20:47:08 1738529228

I think if we referred to LLMs as AK (Artifical Knowledge) instead of AI it would be easier to have more cohesive discussions.

I don’t see how there can be a single truth when there is not even a single definition of many of the underlying terms (intelligence, AGI, etc) which this discipline supposedly defines itself by. Combine that with a lot of people with little philosophical perspective suddenly being confronted with philosophical topics and you end up with a discourse that personally I’ve mostly given up on participating in until things calm down again.

It feels like nobody remembers all the timelines for which we were supposed to have self-driving cars.

polotics · 2025-02-02T13:10:37 1738501837

you are I think badly misrepresenting what Yann Le Cun said: he didn't say LLM's were a dead end, he said to do research in directions that do not require billions of dollars of investment to show results, in particular for PhD's this is sensible, and in view of recent cheaper results, prescient

fragmede · 2025-02-02T17:51:05 1738518665

Sensible with the caveat that deepseek R1 still took millions of dollars off compute time, so you're not training the next one on the box in your basement with a pair of 3090s (though you could certainly fine-tune a shared quantized model). you can't run the full sized model on anything cheap, so. basement researcher still need access to a decent amount of funding, which likely requires outside help.

sitkack · 2025-02-02T07:20:42 1738480842

It is becoming more and more important to determine for ourselves what is true and what is not. No person is right on most things, even when they are an expert in that thing. The biggest trap, is to believe someone because they are passionate, that they say it with conviction. Ignore most of the out of band signaling, take what they are saying and then also see if you can corroborate with another source.

There are so many people who are wrong about so may things.

I really appreciate that you are making your dev with ai videos, it shows people different, more humanistic ways of operating with AI.

Most of what I use AI for is to understand and relearn things I only thought I knew. This I think, is the most powerful use of AI, not in the code writing or the image generation, but in understanding and synthesis.

There is that hilarious tautological statement, "it is easy if you know it".

This video https://www.youtube.com/watch?v=TPLPpz6dD3A shows how to use AI to be a personal tutor using the Socratic Method. This is what people should be using AI for, have it test you for things you think you are already good at and you will find huge gaps in your own understanding. Now go apply it to things you have no clue about.

Speaking of parrots, a large volume of the anti AI sentiment, even here is by people repeating half truths they don't understand, confidently, about what AI cannot do. One would need a pretty tight formal case to prove such things.

Everyone should be playing, learning and exploring with these new tools, not shutting each other down.

antirez · 2025-02-02T07:26:50 1738481210

Yes, the stochastic parrots story is one of the most strong instances in recent times where experts in a field are made blind by their own expertise (the mental model they have of certain things) to the point of being incapable of seeing trivial evidences.

vunderba · 2025-02-02T07:36:05 1738481765

There’s a certain irony in hearing someone describe an LLM as a "stochastic parrot" for the ten-thousandth time when the only reason they’re doing so is that they’ve seen a sufficient number of other people using the exact same term (so now it's in their proverbial training data).

habinero · 2025-02-02T09:04:03 1738487043

It's an accurate description, though. People are running around convincing themselves that the RNG text generator is alive and it's worth slapping some sense into them.

currymj · 2025-02-02T16:40:17 1738514417

it's no longer an accurate description.

with recent models, while stitching together training data is still most of the story, there's obviously something more going on, especially with regards to code, mathematics, and with the ability to successfully produce dialogue that respects pragmatics. this has nothing to do with a real thinking being, but it's become different from a stochastic parrot.

there are deluded people who think that Claude and ChatGPT are really thinking and feeling. it is admirable to try to argue them out of this position.

but if your argument involves saying things that are not true, and anyone can easily see are not true by interacting with a consumer-grade chatbot for 10 minutes, it will just discredit you, and those people will go on believing they're talking to HAL-9000.

habinero · 2025-02-03T10:13:21 1738577601

"Obviously" is doing some heavy lifting there. I don't think it's obvious at all.

What is obvious is people are seeing patterns in the output where there are none. We're hardwired for pattern-seeking and we're great at it. We see faces in clouds and hear words in static.

I've seen a number of people wax about how they interrogate ChatGPT until it gives them "the good stuff" and they're all just Clever Hans-ing themselves.

currymj · 2025-02-03T14:35:20 1738593320

so here's an example: I saw the following amusing math joke on BlueSky, posted a couple days ago.

https://bsky.app/profile/elliotlip.bsky.social/post/3lh5bs7g...

"It's wild that I took a whole-ass class all about differentiable functions from R^2 to itself that satisfy one specific PDE."

output from Claude (one-shot, no retries or trickery) when asked to "explain the joke":

"The joke is playing off the fact that "complex analysis" (the study of differentiable functions satisfying the Cauchy-Riemann equations - a type of PDE) sounds like it should be more complex than it is when described technically. The humor comes from reducing an entire mathematical field to "functions that satisfy one specific PDE," making it sound surprisingly narrow or trivial despite its actual depth and importance."

Wikipedia and lots of math textbooks are in the training data, so it's not so impressive that this model will produce correct statements about math.

But the joke itself was not in the training data (to the best of my knowledge, it's this guy's original joke). And complex analysis was not mentioned in the joke. Yet somehow the text generated is correct with respect to both of those things.

I see things like this quite regularly, which under the "stochastic parrots" story, can't happen.

I've tried to phrase all these sentences very carefully to not claim there is any "agent" or "intelligence" behind the Claude product. There are many explanations for how a language model like this could imitate intelligent dialogue in ways that are somewhat fake and don't generalize -- I think this is what's happening. I also see things break down all the time and the sleight-of-hand fall apart. However, it is not "stochastic parrots" any more.

fragmede · 2025-02-02T18:07:20 1738519640

I'm sure that there are people that are deluded into thinking ChatGPT loves them like a real life flesh and blood being can even when it says it can't, but we have such limited vocabulary, especially as laymen, for describing any non-human intelligence, that saying it's thinking and reasoning aren't entirely unreasonable words to describe what it's doing. sure, it's not thinking in the same way a human would, but when a computer, pre-LLM, and even pre-Internet, is doing something that requires the user to wait, saying the computer is "thinking" is an entirely accepted practice.

So if we want to get people to stop using the words thinking and reasoning, we have to get replacement words into the lexicon. if I tell an LLM A implies B implies C, and I tell it A is true and it's able to tell me that C is thus also true, sure it's entirely due to that much logic existing in its training corpus, but unless we get to a point where I can say that ChatGPT is dot-producting an essay for me, or some other phrase, saying it's not doing "thinking" is going to fall flat on its face. Hell, Deepseek R1's output for the local model literally says <think>. It may not be thinking in a biological being sense, and it may not be reason in a biological sense to conclude C if A implies B implies C, if A is true, but we lack the common colloquial language to describe it otherwise.

currymj · 2025-02-02T18:24:42 1738520682

completely agree. people say a classic chess engine is "thinking" when it goes deeper into the search tree, but nobody is confused about that. This colloquial language doesn't bother me so much.

But there really are people who think they are talking to something more than that. Like within a conversation with today's consumer product, they sincerely believe that an actual being is instantiated who has goals and intentions and talks to them.

habinero · 2025-02-03T10:21:17 1738578077

Using language like "thinking" doesn't bother me. I'm not a stickler for precision language in colloquial speech.

I do think it's important to deflate the hype and give context to what you mean by "thinking" in products, technologies and so on. Calling it a "stochastic parrot" is a bit pithy but not unreasonable. Plus it's memorable.

sitkack · 2025-02-02T07:42:36 1738482156

All they do is predict the next word!

sitkack · 2025-02-02T07:25:10 1738481110

Another trope that stands out is that someone will take a model, run a battery of tests against it and then make general statements about what LLMs can and cannot do without understanding their architecture, the training data, and the training itself.

And then they dress it up to sound scientific, when really they are making hasty generalizations to support a preconceived bias.

guelo · 2025-02-02T09:39:14 1738489154

But what for? Human learning is becoming of diminishing utility as the machines improve. For example, I am now able to create computer programs and beautiful artwork without taking the time to master these skills. You could say that I can use art and programming as basic tools to accelerate my learning of bigger things, but whatever that bigger thing is AI is coming for it too. I can't imagine the progress the machines will achieve in 10 years. We'll be replaced.

anon22981 · 2025-02-02T14:35:43 1738506943

The reason you overestimate their capabilities is because you use them for things you don’t know anything about. It’s like when your nephew made a simple HTML website for himself twenty years ago that was <h1>Hi I am Mark</h1> — it seemed impressive, but you just didn’t know that it wasn’t. Using LLMs in real world complex cases (in programming or art) instantly reveal their significant shortcomings. They are a very good nephew for making stuff that seem impressive, but a bad expert or consultant.

kubb · 2025-02-02T07:18:19 1738480699

I'm sorry but they don't "do progress that was anticipated impossible", especially not every two months.

They were predicted to end the software engineering profession for almost four years already. And it just doesn't happen, even though they can bang out a perfect to-do list in React in a matter of seconds.

LLMs have incremental improvements on the quality of their responses as measured by benchmarks. The speed and cost of inference has also been improving. Despite that there was no major breakthrough since GPT 3.

People keep trying to make them reason, and keep failing at it.

throw310822 · 2025-02-02T07:35:35 1738481735

> They were predicted to end the software engineering profession for almost four years already

ChatGPT was launched on November 30 2022. Two years and two months ago. The fact that in such a short timeframe you're talking about missed predictions is absurd, but telling of the accelerated timeframe in which we're living. The fact is that currently AI and LLMs are going through a phase of explosive improvement, to the point we can expect enormous improvements in capabilities every six months or so.

solumunus · 2025-02-02T11:24:49 1738495489

I use LLM’s daily so I’m no skeptic. We are not seeing enormous improvements every 6 months, that’s hyperbolic. There has been a significant improvement since GPT 3.5, I’ll give you that, but even in those ~2 years I don’t think I’d describe the improvement as “enormous”. The capabilities are similar with output quality improving by a noticeable degree.

pera · 2025-02-02T11:47:49 1738496869

OpenAI API for GPT-3 was launched on June 11, 2020, that's four years and seven months ago:

https://news.ycombinator.com/item?id=23489653

ben_w · 2025-02-02T17:20:26 1738516826

I used that API. It was literally autocomplete — if you wanted it to answer a question, you had to start with, say:

-

Translate into German:

Q: "Good Morning" A: "Guten Morgen"

Q: "<the thing you actually wanted to translate>

-

And even then it might answer with "Good Morning" in five different languages.

InstructGPT is what turned GPT-3 into the ChatGPT-3.5 model.

pera · 2025-02-02T18:52:34 1738522354

Note that GP said "since GPT 3" but the parent responded with "ChatGPT". My response was to clarify the timeframe that has elapsed since GPT-3.

fragmede · 2025-02-02T18:14:15 1738520055

GPT-2 was released November, 2019, so that's five years ago. GPT-3 isn't the advancement to look at, but ChatGPT, which didn't use GPT-3, but a newer RLHF'd model, based off GPT-3, has the 2022 launch date.

https://news.ycombinator.com/item?id=21454273

kubb · 2025-02-02T07:57:30 1738483050

And what has enormously improved since ChatGPTs launch? Maybe you should ask it what it "thinks" about the hype surrounding it.

M4v3R · 2025-02-02T11:17:06 1738495026

If you don’t see the difference in quality of responses between GPT-3.5 as it launched in 2022 and o1/o3 then I don’t know what to tell you. I am using these models daily and the difference is night and day.

ben_w · 2025-02-02T17:40:09 1738518009

In addition to passing bar exam[0], improved performance on medical questions[1], economics questions that experts thought it was years away from[2], all the other things marked in green on page 6 were just the changes from 3.5 to 4: https://arxiv.org/pdf/2303.08774

4o added image analysis.

The o-series starting at o1 improves on 4o as per the margins in these charts: https://openai.com/index/learning-to-reason-with-llms/

I'll have to wait and see about o3, because only the mini model is out yet.

[0] https://law.stanford.edu/2023/04/19/gpt-4-passes-the-bar-exa...

[1] https://ai.nejm.org/doi/full/10.1056/AIdbp2300192

[2] https://www.betonit.ai/p/gpt-4-takes-a-new-midterm-and-gets

kubb · 2025-02-02T18:01:28 1738519288

At this point just paste my comments into ChatGPT and ask it to explain to you what I mean by them. Then paste your response and ask it why it's not addressing the point made. At least use the tool for what it's good for.

ben_w · 2025-02-02T18:20:04 1738520404

So you're saying that it understands you better than I do?

I get that feeling too (in both directions) but this vague and hard to quantify sensation is not what I'd suggest in response to your clearly stated question:

> And what has enormously improved since ChatGPTs launch?

Which is, I think, answered by the things I listed.

kubb · 2025-02-03T06:46:56 1738565216

It doesn’t understand me, but it could help you understand. What you listed aren’t major unexpected leaps but incremental improvements on things that already were known to be possible.

But you insist on being obstinate. ChatGPT advised me to disengage from this conversation.

habinero · 2025-02-03T10:33:08 1738578788

This is highly misleading.

ChatGPT did not ace the bar exam -- it was basically percentile graded against a group of people who mostly failed. If compared to real lawyers, it was 15th percentile on the essay portion

[0] https://law-ai.org/re-evaluating-gpt-4s-bar-exam-performance...

ben_w · 2025-02-03T23:04:02 1738623842

I said pass, not ace.

15th percentile of passes, on the weakest aspect, is still a big improvement over "not passing". That improvement is what I wish to highlight.

(The observation that 48th percentile (lowest overall from your link, let alone 15th for essays) of passes corresponds to 90th percentile of all exam takers, suggests that perhaps too many humans are taking the exams before they're ready).

TeMPOraL · 2025-02-02T10:10:32 1738491032

ChatGPT launched with GPT-3.5. We're now at o1 and o3-mini and DeepSeek-R1, but even in the last year with GPT-4 and GPT-4o, it became better than almost everything involving text than average human. It writes better than average person, faster and cheaper. It parses unstructured data better than average person. There's large number of everyday tasks for which it's perfectly reasonable today to just throw them at ChatGPT. That's all last 1 to 1.5 years.

kubb · 2025-02-02T10:30:12 1738492212

Oh, we have new letter-number combinations now. That is amazing. I stand corrected.

andrewchambers · 2025-02-02T12:49:05 1738500545

If you haven't tried using them then I am not sure your opinion on them is any good.

kubb · 2025-02-02T13:48:39 1738504119

I dare say I'm more familiar with the capabilities of the leading models than certain big tech CEOs are, at least judging by their publicly communicated opinions.

anon22981 · 2025-02-02T14:23:16 1738506196

I use the 4o very often in my work and it mostly sucks. Sometimes it’s very good, sometimes it has nice knowledge that was faster to find from it than a search engine. Mostly it spouts out unhelpful noise (for my problems).

I’m sure if you need to make a to-do list in react it’s like magic (until the app gets complicated). In real world use, not so much.

(Also I have often code reviewed PRs from people who are heavy users and surprise surprise - their output is trash and very prone to bugs or being out of spec.)

TuxSH · 2025-02-03T00:05:22 1738541122

I also think 4o sucks, but have you tried DeepSeek R1 (free on their website)? I thought it night and day between 4o and o3-mini on the following topics:

- reverse engineering: when fed assembly (or decomp or mock impl), it's been consistently been able to figure out what the function actually does/why it's there from a high-level perspective. Whereas ChatGPT merely states the obvious

- very technical C++ questions: DSR1 gives much more detailed answers, with bullet points and examples. Much better writing style. Slightly prone to hallucinations, but not that much

- any controversial topic: ChatGPT models are trained to avoid these because of its "safety" training

ChatGPT is a bit better (and faster) at writing simple code and doing some math faster, but that's it.

(obviously, common sense about what to share and not to share with these chatbots still apply, etc.)

TeMPOraL · 2025-02-03T09:12:57 1738573977

You can access DeepSeek R1? For me, both chat and API have been down for over a week now (it shut down minutes after I topped up my account and generated an API key - I never got to use it :/).

There's lots of fiddling with these models. I found Claude 3.5 Sonnet to be superior to both GPT-4o and o1-preview in around 99% of the things I do; I only started comparing it against o3-mini, and right now it's a mixed bag. Then again, I tend to develop and refine specific prompts for Sonnet, which I haven't for o1-preview and o3-mini, so that could be a factor. Etc.

TuxSH · 2025-02-03T13:11:05 1738588265

> You can access DeepSeek R1?

Yes, well, I live in the EU and thus can avoid US work hours and Chinese peak hours. I think availability has been a bit better since they disabled websearch (also I noticed DSR1 half a week before it made the mainstream news).

> There's lots of fiddling with these models.

Agreed

TeMPOraL · 2025-02-03T18:08:50 1738606130

I live in the EU too. For me, the status page[0] shows a continuous API outage for the past 8 days, that is still ongoing. Since it started, my API requests bounce back with an error, which changes seemingly at random between "unauthorized" and "insufficient balance". Neither of those reasons are valid, since I'm using a valid API key I made after creating an account, which I topped up with $20 (and have an invoice from them to prove it). I must have had a mightily bad luck that the service went down soon after I generated the API key - I'm guessing my user/key is currently stuck in the middle of some migration, or possibly wasn't captured in a backup and got subsequently wiped. For now, I'll just patiently wait for them to fix their service.

--

[0] - https://status.deepseek.com/

TuxSH · 2025-02-03T18:26:01 1738607161

AFAIK it's hosted on Chutes for free too (though limited to between 2k and 10k output tokens). Azure as well, though it might be ratelimited there (or at least it is through openrouter)

mmcnl · 2025-02-02T10:10:19 1738491019

This is a good question. According to some the growth is exponential. Others think ChatGPT is basically still the same as it was at the end of 2022, minor differences aside. Why are the perspectives so different?

kubb · 2025-02-02T10:28:54 1738492134

One of these archetypes is drowning in a hype-fueled news cycle, they mistake speculation for inevitability, dismiss skepticism as ignorance, and construct a reality where the technology's success is unquestionable.

The other is simply using the technology for what it's good for, observing that it's slowly, incrementally improving at tasks that it was already capable of since the major breakthrough, and acknowledging its limitations.

Incremental improvements don't give us any assurance that another major breakthrough is waiting around the corner.

Capricorn2481 · 2025-02-02T08:47:18 1738486038

> to the point we can expect enormous improvements in capabilities every six months or so

Not really, we just can see we've had improvements. That is not evidence of upcoming improvement.

sgt101 · 2025-02-02T12:16:34 1738498594

SE is a good example - I get a lot of help from LLM tools and I think we're learning how to use them better across realistic SDLC processes as well, but we're not replacing lots of people at the moment. On the other hand I saw a business case from one of the big SI's (not my employer but in a deck that was shown by the SI in an discussion) that described the need to move their Indian software dev workforce from 350k FTE to 50K FTE over the next five years.

I think that the onshore impacts will be much lower or negligible, or possibly even positive, because so much work has been offshored already, and as is well worn in every discussion, Jevons paradox may drive up demand significantly (to be fair I believe this as wherever I have worked we've had 3x+ demand (with business cases) for development projects and had to arbitrarily cull 2x of it at the beginning of each year. So, just like the 30 people in India that are working on my project won't do anything useful unless we feed the work to them, the LLM's won't do anything useful either. And just like we have to send lots of work back to India because it's not right, the same is true of LLM's. The difference is that I won't spend 4 hrs on a friday afternoon on Teams discussing it.

But this is not surprising because we've had big impacts from tools like IDE's, VM's, and compilers which have driven seismic changes in our profession, I think that LLM's are just another one of those.

What I'm watching for is an impact in a non tech domain like healthcare or social care. These are important domains that are overwhelmed with demand and riddled with makework, yet so far LLM's have made very little impact. At least, I am not seeing health insurance rates being cut, hospital waiting lists fall or money and staff being redeployed from back office functions to front line functions.

Why hasn't this started?

vachina · 2025-02-02T10:56:11 1738493771

LLMs can hammer out existing solutions to problems, but not never before seen problems.

raindear · 2025-02-02T08:00:05 1738483205

Progress is what happens thanks to AI skeptics busy defining model limitations. The limitations set attractive bars to pass.

rbranson · 2025-02-02T07:23:41 1738481021

Did you read the article? Dziri and Peng are not the “skeptical AI community,” they are in fact die hard AI researchers. This is like saying people who run benchmarks to find performance problems in code are skeptics or haters.

antirez · 2025-02-02T08:02:21 1738483341

I read the article: it does not look like very good research: It's simple to find flaws in LLMs reasoning / compositional capabilities looking at problems that are at the limit of what they can do now, or just picking problems that are very far from their computational model, or submitting riddles. But there is no good analysis of the limitations, nor inspection of how/how much better recently LLMs got exactly at this kind of problems. Also the article is full of uninformative and obvious things to show how LLMs fail in stupid tasks such as multiplication between large numbers.

But the most absurd thing is that the paper looks at computational complexity in terms of direct function composition, and there is no reason an LLM should just use this kind of model when emitting many tokens. Note that even when CoT is not explicit, the LLM output that starts to shape the thinking process still makes it able to have technically unbound layers. With CoT this is even more obvious.

Basically there is no bridge between their restricted model and an LLM.

layer8 · 2025-02-02T23:26:42 1738538802

I think that “part of the scientific community” actually wants to do what needs to be done: “We have to really understand what’s going on under the hood,” she said. “If we crack how they perform a task and how they reason, we can probably fix them. But if we don’t know, that’s where it’s really hard to do anything.”

nuancebydefault · 2025-02-02T18:32:52 1738521172

Well, there appears to be evolution in human perception of capabilities of LLMs. An example, the 'stochastic parrots' notion seems to have mostly died out, at least in HN comments.

starchild3001 · 2025-02-02T13:47:21 1738504041

What a poorly informed article. It's very shallow and out of touch with LLM research. As it stands 6-12 months old models are system 1 thinkers, everybody knows this and knew this even at the time. You need system 2 thinking (test time compute) for more complex logical, algorithmic and reasoning tasks. We knew this when Daniel kahneman wrote thinking fast, thinking slow (over a decade ago) and we still know it today. So LLMs can think but they have to be programmed to think (a la system 2, reasoning, thinking models). There's nothing inherently wrong or limited with LLMs themselves as far as we can tell.

astrange · 2025-02-03T12:52:04 1738587124

This is an example of "metaphor-driven development" in AI, which Phil Agre criticized a few decades ago.

System 1/System 2 isn't a real thing. It's just a metaphor Kahneman invented for a book. AI developers continually find metaphors about the brain, decide they are real, implement something which they give the same name, decide it's both real and the same thing because they have given it the same name, and then find it doesn't work.

(Another common example is "world model", something which has never had a clear meaning, and if you did define it you'd find that people don't have one and don't need one.)

tshadley · 2025-02-02T20:37:49 1738528669

> To understand the capabilities of LLMs, we evaluate GPT3 (text-davinci-003) [11], ChatGPT (GPT-3.5-turbo) [57] and GPT4 (gpt-4)

Oh dear, this is embarrassing. Anil Anathaswamy, are you aware a year in AI research now is like 10 years in every other field?

geoffhill · 2025-02-02T05:01:02 1738472462

Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:

  solve(WaterDrinker, ZebraOwner) :-
      % H01: Five houses with positions 1..5.
      Houses = [ house(1, _, norwegian, _, _, _),  % H10: Norwegian lives in the first house.
                 house(2, blue, _, _, _, _),       % H15: Since the Norwegian lives next to the blue house,
                 house(3, _, _, milk, _, _),        %       and house1 is Norwegian, house2 must be blue.
                 house(4, _, _, _, _, _),
                 house(5, _, _, _, _, _) ],
  
      % H02: The Englishman lives in the red house.
      member(house(_, red, englishman, _, _, _), Houses),
      % H03: The Spaniard owns the dog.
      member(house(_, _, spaniard, _, dog, _), Houses),
      % H04: Coffee is drunk in the green house.
      member(house(_, green, _, coffee, _, _), Houses),
      % H05: The Ukrainian drinks tea.
      member(house(_, _, ukrainian, tea, _, _), Houses),
      % H06: The green house is immediately to the right of the ivory house.
      right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses),
      % H07: The Old Gold smoker owns snails.
      member(house(_, _, _, _, snails, old_gold), Houses),
      % H08: Kools are smoked in the yellow house.
      member(house(_, yellow, _, _, _, kools), Houses),
      % H11: The man who smokes Chesterfields lives in the house next to the man with the fox.
      next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses),
      % H12: Kools are smoked in a house next to the house where the horse is kept.
      next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses),
      % H13: The Lucky Strike smoker drinks orange juice.
      member(house(_, _, _, orange_juice, _, lucky_strike), Houses),
      % H14: The Japanese smokes Parliaments.
      member(house(_, _, japanese, _, _, parliaments), Houses),
      % (H09 is built in: Milk is drunk in the middle house, i.e. house3.)
      
      % Finally, find out:
      % Q1: Who drinks water?
      member(house(_, _, WaterDrinker, water, _, _), Houses),
      % Q2: Who owns the zebra?
      member(house(_, _, ZebraOwner, _, zebra, _), Houses).
  
  right_of(Right, Left, Houses) :-
      nextto(Left, Right, Houses).
  
  next_to(X, Y, Houses) :-
      nextto(X, Y, Houses);
      nextto(Y, X, Houses).

Seems ok to me.

   ?- solve(WaterDrinker, ZebraOwner).
   WaterDrinker = norwegian,
   ZebraOwner = japanese .

orbital-decay · 2025-02-02T06:12:39 1738476759

That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.

This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.

[1] https://arxiv.org/abs/2412.02975

[2] https://news.ycombinator.com/item?id=42889786

antirez · 2025-02-02T08:07:40 1738483660

LLMs using CoT are also decoder-only, it's not a paradigm shift like people want to claim now to don't say they were wrong: it's still next token prediction, that is forced to explore more possibilities in the space it contains. And with R1-Zero we also know that LLMs can train themselves to do so.

janalsncm · 2025-02-02T10:27:13 1738492033

That’s a different paper than the one this article describes. The article describes this paper: https://arxiv.org/abs/2305.18654

mkl · 2025-02-03T08:26:08 1738571168

The article describes both papers.

usaar333 · 2025-02-02T16:12:26 1738512746

A paper that came out 15 months ago?

mkl · 2025-02-03T08:24:59 1738571099

Yes! That one's linked in paragraph three.

teruakohatu · 2025-02-02T08:12:49 1738483969

gpt-4o, asked to produce swi-prolog code, gets the same result using a very similar code. gpt4-turbo can do it with slightly less nice code. gpt-3.5-turbo struggled to get the syntax correct but I think with some better prompting could manage it.

COT is defiantly optional. Although I am sure all LLM have seen this problem explained and solved in training data.

mycall · 2025-02-02T08:03:44 1738483424

This doesn't include Encoder-Decoder Transformer Fusion for machine translation, or Encoder-Only like text classification, named entity recognition or BERT.

leonidasv · 2025-02-02T06:56:24 1738479384

Also, notice that the original study is from 2023.

echelon · 2025-02-02T05:57:44 1738475864

The LLM doesn't understand it's doing this, though. It pattern matched against your "steering" in a way that generalized. And it didn't hallucinate in this particular case. That's still cherry picking, and you wouldn't trust this to turn a $500k screw.

I feel like we're at 2004 Darpa Grand Challenge level, but we're nowhere near solving all of the issues required to run this on public streets. It's impressive, but leaves an enormous amount to be desired.

I think we'll get there, but I don't think it'll be in just a few short years. The companies hyping that this accelerated timeline is just around the corner are doing so out of existential need to keep the funding flowing.

simonw · 2025-02-02T05:08:33 1738472913

Solving it with Prolog is neat, and a very realistic way of how LLMs with tools should be expected to handle this kind of thing.

EdwardDiego · 2025-02-02T05:45:00 1738475100

I would've been very surprised if Prolog to solve this wasn't something that the model had already ingested.

Early AI hype cycles, after all, is where Prolog, like Lisp, shone.

plasticeagle · 2025-02-02T06:43:14 1738478594

Indeed.

https://stackoverflow.com/questions/9252656/einsteins-riddle...

simonw · 2025-02-02T06:45:24 1738478724

I'm certain models like o3-mini are capable of writing Prolog of this quality for puzzles they haven't seen before - it feels like a very straight-forward conversion operation for them.

sitkack · 2025-02-02T07:41:30 1738482090

My comment got eaten by HN, but I think LLMs should be used as the glue between logic systems like prolog, with inductive, deductive and abductive reasoning being handed off to a tool. LLMs are great at pattern matching, but forcing them to reason seems like an out of envelope use.

Prolog would be how I would solve puzzles like that as well. It is like calling someone weak for using a spreadsheet or a calculator.

Abductive Commonsense Reasoning Exploiting Mutually Exclusive Explanations https://arxiv.org/abs/2305.14618

astrange · 2025-02-03T12:55:22 1738587322

I actually coincidentally tried this yesterday on variants of the "surgeon can't operate on boy" puzzle. It didn't help, LLMs still can't reliably solve it.

(All current commercial LLMs are badly overfit on this puzzle, so if you try changing parts of it they'll get stuck and try to give the original answer in ways that don't make sense.)

sitkack · 2025-02-03T15:00:02 1738594802

What do you mean by you tried it?

astrange · 2025-02-03T21:26:19 1738617979

Generated some Prolog programs and looked at them and they were wrong.

Specifically, it usually decides it knows what the answer is (and gets it wrong), then optimizes out the part of the program that does anything.

unification_fan · 2025-02-02T12:23:56 1738499036

I've been saying this ever since GPT 3 came out and I started toying with it.

It's unfortunate that for all the people who work in AI most of them barely even know what Prolog is.

nithril · 2025-02-02T10:03:34 1738490614

It seems quite logical to me as well. An LLM is not a logical computing system but it has the knowledge on how to do a multiplication

baq · 2025-02-02T08:05:43 1738483543

I’ve used DeepSeek for verifying a couple gnarly boolean conditions in home assistant with z3 and it did a good job, though it didn’t one shot it.

TypingOutBugs · 2025-02-02T11:05:08 1738494308

I used a Knights and Knaves puzzle generator last month to test 4o / Claude 3.5 and all failed on novel puzzles

optimalsolver · 2025-02-04T12:19:58 1738671598

Hey, I'm interested in the details of this. How many persons in the puzzle? Did it include nested statements, conditionals and such?

If the puzzle generator is hosted anywhere, I'd love to have a look at it.

lsy · 2025-02-02T05:55:56 1738475756

If the LLM’s user indicates that the input can and should be translated as a logic problem, and then the user runs that definition in an external Prolog solver, what’s the LLM really doing here? Probabilistically mapping a logic problem to Prolog? That’s not quite the LLM solving the problem.

xyzzy123 · 2025-02-02T06:11:42 1738476702

Do you feel differently if it runs the prolog in a tool call?

layer8 · 2025-02-03T00:15:12 1738541712

Not the user you’re replying to, but I would feel differently if the LLM responded with “This is a problem I can’t reliably solve by myself, but there’s a logic programming system called Prolog for which I could write a suitable program that would. Do you have access to a Prolog interpreter, or could you give me access to one? I could also just output the Prolog program if you like.”

Furthermore, the LLM does know how Prolog’s unification algorithm works (in the sense that it can provide an explanation of how Prolog and the algorithm works), yet it isn’t able to follow that algorithm by itself like a human could (with pen and paper), even for simple Prolog programs whose execution would fit into the resource constraints.

This is part of the gap that I see to true human-level intelligence.

baq · 2025-02-02T08:06:51 1738483611

But the problem is solved. Depends what you care about.

endofreach · 2025-02-02T05:59:46 1738475986

Psst, don't tell my clients that it's not actually me but the languages syntax i use, that's solving their problem.

choeger · 2025-02-02T09:52:59 1738489979

So you asked an LLM to translate. It excells in translation. But ask it to solve and it will, inevitably, fail. But that's also expected.

The interesting question is: Given a C compiler and the problem, could an LLM come up with something like Prolog on its own?

charlieyu1 · 2025-02-02T10:43:06 1738492986

I think it could even solve, these kinds of riddles are heavily trained

n144q · 2025-02-02T11:37:33 1738496253

Then what about new, unseen riddles that don't have a similar pattern to existing ones? That's the question people are asking.

Borealid · 2025-02-02T12:23:18 1738498998

If an LLM can solve a riddle of arbitrary complexity that is not similar to an already-solved riddle, have the LLM solve the riddle "how can this trained machine-learning model be adjusted to improve its riddle-solving abilities without regressing in any other meaningful capability".

It's apparent that particular riddle is not presently solved successfully by LLMs, as if it were solved, humans would be having LLMs improve themselves in the wild.

So, constructively, there exists at least one riddle that doesn't have a pattern similar to existing ones, where that riddle is unsolvable by any existing LLM.

If you present a SINGLE riddle an LLM can solve, people will reply that particular riddle isn't good enough. In order to succeed they need to solve all the riddles, including the one I presented above.

fragmede · 2025-02-02T18:32:18 1738521138

Unfortunately, that's a "could an omipotent god create a boulder so heavy he can't move it" level of "logic puzzle" and does your argument no favors.

Borealid · 2025-02-02T20:38:41 1738528721

It's quite the opposite. Converting to words like yours, the argument is "could a powerful but not omnipotent god make themself more powerful", and the answer is "probably".

If the god cannot grant themself powers they're not very powerful at all, are they?

intended · 2025-02-02T08:20:32 1738484432

Science is not in the proving of it.

It’s in the disproving of it, and in the finding of the terms that help others understand the limits.

I dont know why it took me so long to come to that sentence. Yes, everyone can trot out their core examples that reinforce the point.

The research is motivated by these examples in the first place.

Agraillo · 2025-02-02T08:53:36 1738486416

Good point. LLMs can be treated as "theories" and then they definitely meet falsifiability [1] allowing researchers finding "black swans" for years to come. Theories in this case can be different. But if the theory is of logical or symbolic solver then Wolfram's Mathematica may be struggle with understanding the human language as an input, but when evaluating the results, well, I think Stephen (Wolfram) can sleep soundly, at least for now

[1] https://en.wikipedia.org/wiki/Falsifiability

est · 2025-02-02T06:54:03 1738479243

I'd say not only LLM stuggle with these kind of problems, 99% of humans do.

tuatoru · 2025-02-02T06:15:59 1738476959

    solve (make me a sandwich)

Moravec's Paradox is still a thing.

AtlasBarfed · 2025-02-02T18:02:53 1738519373

Can it port sed to java? I just tried to do that in chatgippity and it failed

mmcnl · 2025-02-02T10:13:04 1738491184

There's so much talk about the advancements in AI/LLMs, yet for me ChatGPT as of this date is basically just a faster search engine without cookie banners, clickbait and ads. It hallucinates a lot and it can keep very limited context. Why is there so much promise about future progress but so little actual progress?

EA-3167 · 2025-02-02T16:06:31 1738512391

It's the same cycle we saw with Crypto, there's so much money flying around that the motivation to "believe" is overwhelming. The hype is coming from all directions, and people are social animals that put greater weight on words that come from multiple sources. It's also a platform for people excited about the future to fantasize, and for people terrified of the future to catastrophize.

knowaveragejoe · 2025-02-02T23:13:22 1738538002

I have to wonder how you are using ChatGPT to get a lot of hallucinations or run into issues with limited context.

mikeknoop · 2025-02-02T06:13:18 1738476798

One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.

orbital-decay · 2025-02-02T06:26:29 1738477589

It's Quanta being misleading. They mention several papers but end up with this [1] which talks about decoder-only transformers, not LLMs in general, chatbots, or LLM synthesis engines, whatever that means. The paper also proves that CoT-like planning lets you squeeze more computation from a transformer, which is... obvious? but formally proven this time. Models trained to do CoT don't have some magical on-the-fly compositional ability, they just invest more computation (could be dozens millions of tokens in case of o3 solving the tasks from that benchmark).

[1] https://arxiv.org/abs/2412.02975

kebsup · 2025-02-02T10:37:42 1738492662

I've managed to get llms fail on simple questions, that require thinking graphically - 2D or 3D.

An example would be: you have a NxM grid. How many shapes of XYZ shape can you fit on it?

However, thinking of the transformer video games, AI can be trained to have a good representation of 2D/3D worlds. I wonder how it can be combined so that this graphical representation is used to compute text output.

klodolph · 2025-02-02T04:11:26 1738469486

When one of these limitations gets spelled out in an article, it feels like six months later, somebody has a demo of a chatbot without that particular limitation.

These limitations don’t seem in any way “fundamental” to me. I’m sure there are a ton of people gluing LLMs to SAT solvers as we speak.

chefandy · 2025-02-02T04:44:49 1738471489

Could you give an example of something we recently solved that was considered an unsolvable problem six months beforehand? I don’t have any specific examples, but it seems like most of the huge breakthrough discoveries I’ve seen announced end up being overstated and for practical usage, our choice of LLM-driven tools is only marginally better than they were a couple of years ago. It seems like the preponderance of practical advancement in recent times has come from the tooling/interface improvements rather than generating miracles from the models themselves. But it could be that I just don’t have the right use cases.

munchler · 2025-02-02T05:00:49 1738472449

Take a look at the ARC Prize, which is a test for achieving "AGI" created in 2019 by François Chollet. Scroll down halfway on the home page and ponder the steep yellow line on the graph. That's what OpenAI o3 recently achieved.

[0] https://arcprize.org/

[1] https://arcprize.org/blog/oai-o3-pub-breakthrough

mrshadowgoose · 2025-02-02T05:21:38 1738473698

Reviewing the actual problems is highly recommended: https://kts.github.io/arc-viewer/

They're not particularly difficult, but clearly require reasoning to solve.

UnlockedSecrets · 2025-02-02T05:49:52 1738475392

unless you train directly against solving those problems... in which case how could you theoretically design a test that could stand against training directly against the answer sheet?

munchler · 2025-02-02T05:53:49 1738475629

That's why they keep the evaluation set private: "Submit a solution which scores 85% on the ARC-AGI private evaluation set and win $600K."

[0] https://arcprize.org/guide

EdwardDiego · 2025-02-02T05:30:10 1738474210

So we're only 12% from AGI?

I'm dubious tbh. Given we still can't simulate a nematode.

simonw · 2025-02-02T05:41:28 1738474888

ARC creator François Chollet says: https://bsky.app/profile/fchollet.bsky.social/post/3les3izgd...

I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.

It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.

Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.

sitkack · 2025-02-02T07:53:45 1738482825

> designed as the simplest, most basic assessment of fluid intelligence possible.

This was the goal, but that doesn't say what the test itself is. Try to get a human to solve this problem without their visual cortex, they couldn't do it. Stating your goal for a thing, doesn't make the thing that goal.

AI researchers designing intelligence tests are like programmers designing their own cryptography.

How about we have people skilled in neuropsychology, psychometrics and cognitive psychology do what they are good at.

baq · 2025-02-02T08:13:17 1738483997

> How about we have people skilled in neuropsychology, psychometrics and cognitive psychology do what they are good at.

Disagree. The thing that we will eventually call AGI will not be human. No need to have human-specific evaluations unless you’re aiming for an artificial human and not just an artificial intelligence.

sitkack · 2025-02-02T10:12:07 1738491127

But why ignore a huge body of research in how to write scientific tests of intelligence and cognition?

Smells like linear algebra exceptionalism.

Is ARC AGI really the, "simplest, most basic assessment of fluid intelligence possible" ?