The main thesis here seems to be that LLMs behave like almost all other machine learning models, in that they are doing pattern matching on their input data, and short circuiting to a statistically likely result. Chain of thought reasoning is still bound by this basic property of reflexive pattern matching, except the LLM is forced to go through a process of iteratively refining the domain it does matching on.
Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.
Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.
LLMs are trained, as others have mentioned, first to just learn the language at all costs. Ingest any and all strings of text generated by humans until you can learn how to generate text in a way that is indistinguishable.
As a happy side effect, this language you've now learned happens to embed quite a few statements of fact and examples of high-quality logical reasoning, but crucially, the language itself isn't a representation of reality or of good reasoning. It isn't meant to be. It's a way to store and communicate arbitrary ideas, which may be wrong or bad or both. Thus, the problem for these researchers now becomes how do we tease out and surface the parts of the model that can produce factually accurate and reasonable statements and dampen everything else?
Animal learning isn't like this. We don't require language at all to represent and reason about reality. We have multimodal sensory experience and direct interaction with the physical world, not just recorded images or writing about the world, from the beginning. Whatever it is humans do, I think we at least innately understand that language isn't truth or reason. It's just a way to encode arbitrary information.
Some way or another, we all grok that there is a hierarchy of evidence or even what evidence is and isn't in the first place. Going into the backyard to find where your dog left the ball or reading a physics textbook is fundamentally a different form of learning than reading the Odyssey or the published manifesto of a mass murderer. We're still "learning" in the sense that our brains now contain more information than they did before, but we know some of these things are representations of reality and some are not. We have access to the world beyond the shadows in the cave.
Humans can carve the world up into domains with a fixed set of rules and then do symbolic reasoning within it. LLMs can't see to do this in a formal way at all -- they just occasionally get it right when the domain happens to be encoded in their language learning.
You can't feed an LLM a formal language grammar (e.g. SQL) then have it only generate results with valid syntax.
It's awfully confusing to me that people think current LLMs (or multi-modal models etc) are "close" to AGI (for whatever various definitions of all those words you want to use) when they can't do real symbolic reasoning.
Though I'm not an expert and happy to be corrected...
Adult humans can do symbolic reasoning, but lower mammals cannot. Even ones that share most of our brain structure are much worse at this, if they can do it at all; children need to learn it, along with a lot of the other things that we consider a natural part of human intelligence.
That all points towards symbolic reasoning being a pretty small algorithmic discovery compared to the general ability to pattern match and do fuzzy lookups, transformations, and retrievals against a memory bank. It's not like our architecture is so special that we burned most of our evolutionary history selecting for these abilities, they're very recent innovations, and thus must be relatively simple, given the existence of the core set of abilities that our close ancestors have.
The thing about transformers is that obviously they're not the end of the line, there are some things they really can't do in their current form (though it's a smaller set than people tend to think, which is why the Gary Marcuses of the world always backpedal like crazy and retcon their previous statements as each new release does things that they previously said were impossible). But they are a proof of concept showing that just about the simplest architecture that you could propose that might be able to generate language in a reasonable way (beyond N-gram sampling) can, in fact, do it really, really well even if all you do is scale it up, and even the simplest next-token prediction as a goal leads to much higher level abilities than you would expect. That was the hard core of the problem, building a flexible pattern mimic that can be easily trained, and it turns out to get us way further along the line to AGI than I suspect anyone working on it ever expected it would without major additions and changes to the design. Now it's probably time to start adding bits and bobs and addressing some of the shortcomings (e.g. static nature of the network, lack of online learning, the fact that chains of thought shouldn't be constrained to token sequences, addressing tokenization itself, etc), but IMO the engine at the heart of the current systems is so impressively capable that the remaining work is going to be less of an Einstein moment and more of an elbow grease and engineering grind.
We may not be close in the "2 years of known work" sense, but we're certainly not far in the "we have no idea how to prove the Riemann Hypothesis" sense anymore, where major unknown breakthroughs are still required which might be 50+ years away, or the problem might even be unsolvable.
Yes, I've always thought that LLMs need the equivalent of a limbic system. This is how we solved this problem in organic computers. There is no static 'reward function'. Instead, we have a dynamic reward function computer. It decides from day to day and hour to hour what our basic objectives are. It also crucially handles emotional 'tagging' of memory. Memories that we store are proportionally more likely to be retrieved under similar emotional conditions. It helps to filter relevant memories, which is something LLMs definitely could use.
I think the equivalent of an LLM limbic system is more or less the missing piece for AGI. Now, how you'd go about making one of those I have no idea. How does one construct an emotional state space?
Companies are bad about doing this on purpose. If they set out to build AGI and accomplish something novel, just call that AI and go on fund raising from people who don't know better (or more likely don't care and just want to gamble with others' money).
Continuous RL in a sense. There maybe an undiscovered additional scaling law around models doing what you describe; continuous LLM-as-self-judge, if you will.
Provided it can be determined why a user ended the chat, which may turn out to be possible in some subset of conversations.
And also sometimes write down the conclusion and work backwards, without considering that the reason most likely for the conclusion isn't necessarily going to have the conclusion as the most likely conclusion — I hope I phrased that broken symmetry correctly.
I did think of sour grapes (only thing that came to my mind) and was hoping for something better. Sour grapes doesn't seem too interesting. I think most people can tell you, if you actually ask them, what the differences between their actual successes and their wildest dreams are. But any improvement is still a success and I think that's valid.
The milliken oil drop experiment, “winning “ the space race, mostly anything C levels will tell the board and shareholders at a shareholder meeting, the American wars in Iraq and Afghanistan, most of what Sam Altman or Elon musk has to say, this list continues.
I think you're approaching it form very high level, when you should think about it from much lower level, i.e. success is being determined by stress/dopamine hormones or similar
This article is kind of vague on that tbf:
To conclude, we observed no credible evidence for a beneficial effect of L-dopa (vs. Haloperidol) on reinforcement learning in a reward context, as well as the proposed mechanistic account of an enhanced striatal prediction error response mediating this effect.
Is that controversial? I would say everything a human does is to feel better, and everything someone does that doesn’t make them feel better immediately is just done in the expectation of even greater pleasure later.
Well mine can, with some tactics and strategy layered on top. If I do something I don’t like, I only do it because the payoff later makes it worth it (or at least I think it will from my current knowledge).
It is important that “profit”, comes in various forms, which exchange rates are problematic to calculate (or maybe there can’t be any): not hungry, not thirsty, tastes good, not cold, feel safe, feel excited, feel righteous, feel powerful, listen to music, watch a movie, get curious, satisfy curiosity, laugh, love, sex, rock n roll.
Most behavior we believe is some kind of rational action when it is really blind actions based on fiction or just completely random with rationalizations for the behavior after the fact.
>Interesting to think about what structures human intelligence has that these models don't.
Kant's Critique of Pure Reason has been a very influential way of examining this kind of epistemology. He put forth the argument that our ability to reason about objects comes through our apprehension of sensory input over time, schematizing these into an understanding of the objects, and finally, through reason (by way of the categories) into synthetic a priori knowledge (conclusions grounded in reason rather than empiricism).
If we look at this question in that sense, LLMs are good at symbolic manipulation that mimics our sensibility, as well as combining different encounters with concepts into an understanding of what those objects are relative to other sensed objects. What it lacks is the transcendental reasoning that can form novel and well grounded conclusions.
Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.
>Such a system that could do this might consist of an LLM layer for translating sensory input (in LLM's case, language) into a representation that can be used by a logical system (of the kind that was popular in AI's first big boom) and then fed back out.
This just goes back into the problems of that AI winter again though. First Order Logic isn't expressive enough to model the real world, while Second Order Logic dosen't have a complete proof system to truly verify all it'sstatements, and is too complex and unyieldy for practical uses. The number of people I would also imagine that are working on such problems would be very few, this isn't engineering that it is analytic philosophy and mathematics.
Kant predates analytical philosophy and some of its failures (the logical positivism you are referring to). The idea here is that first order logic doesn't need to be expressive enough to model the world. Only that some logic system is capable of modeling the understanding of a representation of the world mediated by way of perception (via the current multimodal generative AI models). And finally, it does not need to be complete or correct, just equivalent or better than how our minds do such.
With DeepSeek-R1-Zero, their usage of RL didn't have reward functions really that indicated progress towards the goal afaik.
It was "correct structure, wrong answer", "correct answer", "wrong answer". This was for Math & Coding, where they could verify answers deterministically.
> Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.
What is also a benefit for humans, I think, is that people are typically much more selective. LLMs train to predict anything on the internet, so for example for finance that includes clickbait articles which have a lifetime of about 2 hours. Experts would probably reject any information in these articles and instead try to focus on high quality sources only.
Similarly, a math researcher will probably have read a completely set of sources throughout the life than, say, a lawyer.
I’m not sure it’s a fundamental difference, but current models do seem to not specialize from the start unlike humans. And that might be in the way of learning the best representations. I know from ice hockey for example, that you can see within 3 seconds whether someone played ice hockey from young age or not. Same with language. People can usually hear an accent within seconds. Relatedly, I've used OpenAI's text to speech a while back and the Dutch voice had an American accent. What this means is that even if you ask LLMs about Buffett's strategy, maybe they have a "clickbait accent" too. So with the current approach to training, the models might never reach absolute expert performance.
When I was doing some NLP stuff a few years ago, I downloaded a few blobs of Common Crawl data, i.e. the kind of thing GPT was trained on. I was sort of horrified by the subject matter and quality: spam, advertisements, flame wars, porn... and that seems to be the vast majority of internet content. (If you've talked to a model without RLHF like one of the base Llama models, you may notice the personality is... different!)
I also started wondering about the utility of spending most of the network memorizing infinite trivia (even excluding most of the content above, which is trash), when LLMs don't really excel at that anyway, and they need to Google it anyway to give you a source. (Aside: I've heard soke people have good luck with "hallucinate then verify" with RAG / Googling...)
i.e. what if we put those neurons to better use? Then I found the Phi-1 paper, which did exactly that. Instead of training the model on slop, they trained it on textbooks! And instead of starting with PhD level stuff, they started with kid level stuff and gradually increased the difficulty.
You can get rid of the trivia by training one model on the slop, then a second model on the first one - called distillation or teacher-student training. But it's not much of a problem because regularization during training should discourage it from learning random noise.
The reason LLMs work isn't because they learn the whole internet, it's because they try to learn it but then fail to, in a useful way.
If anything current models are overly optimized away from this; I get the feeling they mostly want to tell you things from Wikipedia. You don't get a lot of answers that look like they came from a book.
I don't know, babies hear a lot of widely generic topics from multiple people before learning to speak.
I would rather put it that humans can additionally specialize much more, but we usually have a pretty okay generic understanding/model of a thing we consider as 'known'. I would even wager that being generic enough (ergo, has been sufficiently abstracted) is possibly the most important "feature" human's have? (In the context of learning)
> For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.
What we can integrate, we seem to integrate efficiently*; but compared to the quantities used to train AI, we humans may as well be literally vegetables.
* though people do argue about exactly how much input we get from vision etc., personally I doubt vision input is important to general human intelligence, because if it was then people born blind would have intellectual development difficulties that I've never heard suggested exist — David Blunket's success says human intelligence isn't just fine-tuning on top of a massive vision-grounded model.
Low level details like that aren’t relevant to this discussion. Most human processing power is at the cellular level. The amount of processing power in a single finger literally dwarfs a modern data center, but we can’t leverage that to think only live.
So it’s not a question of ‘a lot’ it’s a question of orders of magnitude vs “the quantities used to train AI”
Library of congress has what 39 million books, tokenize every single one and you’re talking terabytes of training data for an LLM. We can toss blog posts etc to that pile but every word ever written by a person isn’t 20 orders of magnitude larger or anything.
>Hearing is also well into the terabytes worth of information per year.
If we assume that the human auditory system is equivalent to uncompressed digital recording, sure. Actual neural coding is much more efficient, so the amount of data that is meaningfully processed after multiple stages of filtering and compression is plausibly on the order of tens of gigabytes per year; the amount actually retained is plausibly in the tens of megabytes.
Don't get me wrong, the human brain is hugely impressive, but we're heavily reliant on very lossy sensory mechanisms. A few rounds of Kim's Game will powerfully reveal just how much of what we perceive is instantly discarded, even when we're paying close attention.
The sensory information form individual hairs in the ear start off with a lot more data to process than simple digital encoding of two audio streams.
Neural encoding isn’t particularly efficient from a pure data standpoint just an energy standpoint. A given neuron not firing is information and those nerve bundles contain a lot of neurons.
Is that a positive thing? If anything I would consider that as the reverse - LLMs have the "intelligence of vegetables" because even with literally the whole of human written knowledge they can at most regurgitate that back to us with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.
But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.
It's perfectly legit to call these models "thick" because they *need* to read such a vast quantity of text that a human would literally spend two thousand lifetimes to go through it even if that was all the human did with their days.
It also remains the case that, unlike us, they can go through all of that in a few months.
> with no novelty whatsoever, even though a 2 years old with a not even matured brain can learn a human language from orderS of magnitude less and lower quality input from a couple of people only.
You're either grossly underestimating AI or overestimating 2 year olds, possibly both.
I just about remember being a toddler, somewhere between then and 5 was around the age I had the idea that everyone got an invisible extra brain floating next to them for every year they lived. Took me an embarrassingly long time (teens, IIRC) to realise that the witch-duck-weight-comparison scene in Monty Python and the Holy Grail wasn't a documentary, thanks to the part of the film captioned "Famous Historian". One time my dad fell ill, and he was talking to mum about "the tissue being damaged" while I was present, so I gave him a handkerchief (AKA "a tissue"). And while I don't remember this directly, my mum's anecdotes include me saying "fetrol fump", waving a spoon in a jam pan and calling this act "spelling", and when discovered running around with my pockets inside-out explaining myself as trying to fly because I apparently thought that the lining of a pocket was called a "wing".
When it comes to human novelty, I also quite often find there's a lot of remixing going on that just isn't immediately apparent. As Steve Jobs apparently once said, “Good artists copy; great artists steal.”, except Jobs stole that quote from Picasso.
It's easy to categorise different levels with AI, but which one of these counts as "novelty", and how often do humans ever achieve each of these grades?
0. Memorisation of the training set. Think: bunch of pictures, pick best fit.
1. Linear interpolation between any pair of elements in the training set. Think: simple cross-fade between any two pictures, but no tracking or distorting of features during that fade.
2. Let the training set form a basis vector space, and interpolate freely within the constraints of the examples. Think: if these pictures are faces, it would make any hair colour between the most extreme limits shown, etc.
3. Extrapolate beyond the examples. Think: Even if no black or white hair was visible, so long as several shades of grey were, it could reach the ideas of black or white hair.
4. Invent a new vector. Think: even if it had been trained only on black-and-white images, it could still invent green hair.
> But any Nobel-price winner has read significantly less than a basic LLM, and we see no LLM doing any tiny scientific achievement, let alone that high impact ones.
We do see them doing *tiny* scientific achievements, with extra emphasis on "tiny". Just like with using them in software, even the best "only" act like fresh graduates.
When any AI gets to high-impact… the following (fictional) quote comes to mind: "as soon as we started thinking for you, it really became our civilization."
> that a human would literally spend two thousand lifetimes to go through it even if that was all the human did with their days.
Well, `cp` would go over that data even faster, but depending on what retention/conclusion is reached from that it may or may not be impressive.
Humans are fundamentally limited by our biology, and rotating a tiny sphere and turning pages and serial processing does make certain hard limits on us.
A two years old can definitely say stupid stuff, or have wildly incomplete/incorrect models of their reality, but can most certainly already think and reason, and update their internal models at any point.
> Tiny scientific achievements, only acting as fresh graduates with regards to software
I don't believe they are anywhere close to being as good at software as a fresh graduate. Sure, many people write terrible code, and there are a lot of already solved problems out there (not even just solved, but solved thousands times) - LLMs are definitely a novel tool when it comes to finding information based on some high-ish level patterns (over exact string match, or fuzzy match), and they are very good at transforming between different representations of said data, with minimal (and hard limited) reasoning capabilities, but I have never seen evidence of going any further than that.
I don't think your grades are "correct" - e.g. a random generator can easily create new vectors, but I wouldn't call that intelligence. Meanwhile, that two years old can do a novel discovery from their POV every couple of day, potentially turning around their whole world model each day. To me, that sounds way "cooler" than a statistically likely token given these previous tokens, and LLMs definitely need some further structure/architecture to beat humans.
--
I do like your last quote though, and definitely agree there!
> Well, `cp` would go over that data even faster, but depending on what retention/conclusion is reached from that it may or may not be impressive.
Sure, but it would be a level zero on that list, right?
I'd say even Google would be #0.
> A two years old can definitely say stupid stuff, or have wildly incomplete/incorrect models of their reality, but can most certainly already think and reason, and update their internal models at any point.
I think that this presumes a certain definition of "think" and "reason". Monsters under the bed? To move from concrete examples to the abstract, from four apples to the idea of four?
Imagine a picture of a moon's orbit around the parent planet and the planet's orbit around a star, first at one time of year, then again 60° later, the circular orbits of each drawn clearly, with the two positions of the moon's orbits aligned at the top of the image; exaggerate the scale for clarity, and find it in an astronomy book — my peers at age 6 or 7 thought it was a picture of a mouse.
Imagine teachers and an ambulance crew explaining to the class how blood is donated, showing that they're putting a bag up the teachers sleeves and explaining how they'll demonstrate this by taking "blood" (fake? No idea at this point) from that bag. Everyone's looking, we see it go up the sleeve. We see the red stuff come out. Kid next to me screams "they're killing her!". Rather than say "we literally saw the bag go up the sleeve", 5-year-old-me tried to argue on the basis that killing a teacher in front of us was unlikely — not wrong, per say, but a strange argument and I wondered even at the time why I made it.
Are these examples of "reason"? Could be. But, while I would say that we get to the "children say funny things" *with far fewer examples than the best AI*, it doesn't seem different in kind to what AI does.
> LLMs are definitely a novel tool when it comes to finding information based on some high-ish level patterns (over exact string match, or fuzzy match), and they are very good at transforming between different representations of said data, with minimal (and hard limited) reasoning capabilities, but I have never seen evidence of going any further than that.
Aye. So, where I'm going with #2 and #3: even knowing what the question means well enough to respond by appropriately gluing together a few existing documents correctly, requires the AI to have created a vector space of meaning from the words — the sort of thing which word2vec did. But:
To be able to translate questions into answers when neither the question nor the answer are themselves literally in the training set, requires at least #2. (If it was #1, you might see it transition from "Elizabeth II was Queen of the UK" to "Felipe VI is King of Spain" via a mid-point of "Macron is Monarch of France").
For #3, I've tried the concrete example of getting ChatGPT (free model a few months back now) to take the concept of the difference between a racoon and a wolf and apply this difference again on top of a wolf, and… well, their combination of LLM and image generator gave me what looked like a greyhound, so I'm *not* convinced that OpenAI's models demonstrate this in normal use — but also, I've seen this kind of thing demonstrated with other models (including Anthropic, so it's not a limit of the Transformer architecture) and the models seem to do more interesting things.
Possibly sample bias, I am aware of the risk of being subject to a Clever Hans effect.
For #4, this seems hard to be sure it has happened when it seems to have happened. I don't mean what word2vec does, which I realise now could be described in similar language, as what word2vec does is kinda a precursor to anything at least #1. Rather, what I mean, in a human, would seem like "spots a black swan before it happens". I think the invention of non-Euclidian geometry might count, but even then I'm not sure.
I feel like if you take the underlying transformer and apply to other topics, e.g., eqtransformer, nobody questions this assumption. It’s only when language is in the mix do people suggest they are something more and some kind of “artificial intelligence” akin to the beginnings of Data from Star Trek or C3P0 from Star Wars.
Human processing is very interesting and should likely lead to more improvements (and more understanding of human thought!)
Seems to me humans are very good at pattern matching, as a core requirement for intelligence. Not only that, we are wired to enjoy it innately - see sudoku, find Waldo, etc.
We also massively distill input information into short summaries. This is easy to see by what humans are blind to: the guy in a gorilla suit walking through a bunch of people passing a ball around, or basically any human behavior magicians use to deceive or redirect attention. We are mombarded with information constantly. This is the biggest difference between us and LLMs as we have a lot more input data and also are constantly updating that information - with the added feature/limitation of time decay. It would be hard to navigate life without short term memory or a clear way to distinguish things that happened 10 minutes ago from 10 months ago. We don't fully recall each memory of washing the dishes but junk the vast, vast majority of our memories, which is probably the biggest shortcut our brains have over LLMs.
Then we also, crucially, store these summaries in memory as connected vignettes. And our memory is faulty but also quite rich for how "lossy" it must be. Think of a memory involving a ball from before the age of 10 and most people can drum up several relevant memories without much effort, no matter their age.
> Interesting to think about what structures human intelligence has that these models don't.
Pain receptors. If you want to mimic human psyche you have to make your agent want to gather resources and reproduce. And make it painful to lack those resources.
Now, do we really have to mimic human intelligence to get intelligence? You could make the point the internet is now a living organism but does it have some intellect or is it just some human parasite / symbiote?
>Interesting to think about what structures human intelligence has that these models don't.
If we get to the gritty details of what gradient descent is doing, we've got a "frame", i.e a matrix or some array of weights contains the possible solution for a problem, then with another input of weights we're matching a probability distribution to minimize the loss function with our training data to form our solution in the "frame". That works for something like image recognition, where the "frame" is just the matrix of pixels, or in language models where we're trying to find the next word-vector given a preceding input.
But take something like what Sir William Rowan Hamilton was doing back in 1843. He know that complex numbers could be represented in points in a plane, and arthimetic could be performed on them, and now he wanted to extend a similar way for points in a space. With triples it is easy to define addition, but the problem was multiplication. In the end, he made an intuitive jump, a pattern recognition when he realized that he could easily define multiplications used quadruples instead, and thus was born the Quaternion that's a staple in 3D graphics today.
If we want to generalize this kind of problem solving into a way that gradient descent can solve, where do we even start? First of all, we don't even know if a solution is possible or coherent or what "direction" we are going towards. It's not a systematic solution, it's rather one that pattern in one branch of mathematics was recognized into another. So perhaps you might use something like Category Theory, but then how are we going to represent this in terms of numbers and convex functions, and is Category Theory even practical enough to easily do this?
> Interesting to think about what structures human intelligence has that these models don't
Chiefly?
After having thought long and hard, building further knowledge on the results of the process of having thought long and hard, and creating intellectual keys to further think long and hard better.
You do not need sensorial feedback to do math. And you do not need full sensors to have feeback - one well organized channel can suffice for some applications.
To learn new math, a professional mathematician foremostly just thinks further (it's its job); to discuss with other entities (and acquire new material), textual input and outputs suffice.
Your statement, not mine. And I wrote intelligence, not sentience.
People who become quadriplegic as adults (or older children) have already developed intelligence before.
My theory (which I have not researched in any way) implies that someone born fully quadriplegic would be severely impaired in developing intelligence. Sight and hearing are of course also important sources of feedback, the question is whether they are sufficient.
You might get a kick out of this essay by Robert Epstein from 2016: https://aeon.co/essays/your-brain-does-not-process-informati... (The empty brain - Your brain does not process information, retrieve knowledge or store memories. In short: your brain is not a computer)
Maybe I misunderstood it, but I feel that it's a weird article, because it fails to establish any vocabulary and then seem to uses words in uncertain ways, as if constructing the narrative by specifically crafting (but never truly explaining/define) some model that's not true, but presenting the argument with significantly expanded scope. Drastically reduced (which is not really correct, but may help me to convey my general impression/feelings only) it's kinda sorta like-ish "we aren't doing it the way our computers do, thus the information processing metaphor is wrong".
Like when talking about that experiment and an image of the dollar bill, it never talks about what's an "image", just states that there wasn't one stored in a brain, in "any sense". And then goes on describing the idea that seem to match the description of a "mental image" from cognitive science.
As I [very naively] get it... Information theory is a field of mathematics. Unlike all those previous concepts like humours, mechanical motions or electric activities, math is here to establish terminology and general principles that don't have to fundamentally change if^W when we learn more. And that's why it got stuck.
There is a whole genre of essays like this talking about behaviour in a human specific way. But, I wish they engaged with the notions of the Church-Turing thesis and the Universal Turing Machine which indicates that any behaviour following standard physics principles is in fact computable.
(FWIW, I dont think that humans can be reduced to computing, but the Church-Turing thesis is a powerful counterargument which more biologists and psychologists should engage with).
I stopped reading before reaching 2/3 of it but the start is already strawman after strawman (or misunderstanding to be generous).
I don’t think most people believe the brain is made up of a discrete Processing part that accesses information from a memory part that’s encoded in binary there. But just because the brain doesn’t contain a literal encoding of something in binary doesn’t mean the neurons don’t store the information.
If you download the weights of an LLM, you’re not going to find the text it can output „from memory“ anywhere in the file, but the weights still encode the information and can retrieve it (with some accuracy).
Coming up with a reward model seems to be really easy though.
Every decidable problem can be used as reward model. The only downside to this is that the LLM community has developed a severe disdain for making LLMs perform anything that can be verified by a classical algorithm. Only the most random data from the internet will do!
Your post on Twitter uses slightly more words than the ones preceding it above to make the exact same point. Was there really any reason to link to it? Why not expand on your argument here?
"LLMs are fundamentally matching the patterns they've seen, and their abilities are constrained by mathematical boundaries. Embedding tricks and chain-of-thought prompting simply extends their ability to do more sophisticated pattern matching."
LLMs keep showing more and more they are the wonder of AI that we awaited for decades: talking machines that every two months do progresses that two months before were anticipated impossible because of <put here some limit that was actually in the prejudice of the skeptical AI community> (just stochastic parrots, no reasoning possible without symbolic representations, there are no longer tokens, ...)
At the same time, part of the scientific community continues to diminish what was accomplished and the steps that are being made. A few months ago LeCun arrived to tell new researchers to move away from LLMs since they are a dead end: imagine the disservice he made to the surely non-zero folks that followed the advice, putting themselves out of the AI research that matters. (Incidentally, this skepticism of the Meta AI head must have something to do with the fact that Meta, despite the huge efforts allocated, produced the worst LLM among Anthropic, OpenAI, DeepSeek -- I bet Zuckerberg is asking questions lately).
It's very hard to explain this behavior if not by psychological denial.
[EDIT: you can't see the score of this comment, but I can: it's incredible how it goes from 3, to -2, to 1, and so forth. The community is split in two, and it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth]
I get the sentiment, but I actually think some skepticism in the system is healthy.
Billions are flowing towards LLMS, and Sam Altman will overpromise AGI is just around the corner and the days of jobs are gone to fill his coffers to anyone that will listen.
Additionally if we begin to use these things in real production environments where mistakes matters, knowing the exact limitations is key.
None of this takes away from the fact that these are exciting times.
I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.
So instead of everyone having some enthusiasm and some skepticism, you get a bifurcation where whole classes of people act as the skeptics and others as the enthusiasts. I view the strong skeptics as more “in the wrong” because they often don’t use LLMs much. If you are an actual enthusiastic user, you simply can’t get good performance without a very strong dose of skepticism towards everything LLMs output.
I don't think everyone shares those doubts. The first time you catch an LLM in a lie is sobering, but there are lots of areas, and thus lots of users, for whom it doesn't hallucinate for, because they're asking softball questions and it doesn't end up hallucinating, or hallucinations just really aren't aren't that big a deal. (eg an LLM horoscope generator or using it write sci fi.)
so while we're on HN going back and forth about how outright lies by the system indight the whole thing for everybody, we should be careful to note that it's not for everybody, or rather, it's a known limitation so don't trust it to cite real cases for you as a lawyer, but using it to help you figure out what mens rea means in a practical sense by asking it questions about the concept, totally.
Honestly, hallucinations happen so rarely for me because of the kinds of things I ask it, that it doesn't happen enough for me to not believe it's answers in low-stakes situations, or situations on the level of horoscope generation, and I'm sure I'm not alone in treating ChatGPT that way, despite evidence to the contrary.
> I can’t communicate enough how the skepticism (“this is just hype” or “LLMs are stochastic parrots”) is the vastly dominant thought paradigm in European academic circles.
I'm very curious. If you don't mind taking the time to elaborate, will you give a few examples of such skepticism/naysaying? Thank you.
Yes there is another part of the community that overhypes everything. But I can expect that from a CEO of an AI company (especially if he is Altman), but from researches? Also the fact that LLMs may reach superhuman expertise in certain fields in a short timeframe (a few years), since reinforcement learning is starting to be applied to LLMs may no longer be a totally crazy position. If it is possible to extend considerably the same approach seen in R1-Zero there could be low hanging fruits around the corner.
This article is about things which aren't limitations anymore!
You are applauding it as pushback for pushback's sake, but it's an article about limitations in biplane construction, published after we'd already landed on the moon.
Is there any evidence that these fundamental issues with compositionality have been resolved or are you just asserting it? Has the paper been replicated with a CoT model and had a positive result?
Well, yes — because modern models can solve all the examples in the article. The theory of compositionality is still an issue, but the evidence for it recedes.
I think most of the issue comes from the challenge of informational coherence. Once incoherence enters the context, the intelligence drops massively. You can have a lot of context and LLMs can maintain coherence— but not if the context itself is incoherent.
And, informationally, it is just a matter of time before a little incoherence gets into a thread.
This is why agents have so much potential—being able to separate out separate threads of thought in different context windows reduces the likelihood of incoherence emerging (vs one long thread).
>I think most of the issue comes from the challenge of informational coherence. Once incoherence enters the context, the intelligence drops massively. You can have a lot of context and LLMs can maintain coherence— but not if the context itself is incoherent.
As a non-expert, part of my definition of intelligence is that the system can detect incoherence, a.k.a reject bullshit. LLMs today can't do that and will happily emit bullshit in response.
Maybe the "gates" in the "workflows" discussed in the Anthropic article are a practical solution to that. But that still just seems like inserting human intelligence into the system for a specific engineering domain; not a general solution.
> …it is pretty sad since this is not a matter of taste or political inclination: there must be a single truth
This is more of a salient point that you perhaps realized. In life there is no single absolute, unknowable truth. Philosophy has spent the entire span of human existence grappling with this topic. The real risk with AI is not that we build some humanity-destroying AGI, but that we build a machine that is 'convincing enough' — and the idea that such a machine would be built by people who believe in objective truth is the most worrying part.
A teleological argument that assumes truth is contingent upon a specific worldview would indeed be flawed, because it would make truth an artifact of a given perspective rather than something independent of it.
> At the same time, part of the scientific community continues to diminish what was accomplished
Revisit the idea: part of the public is bewildered by voices that started calling "intelligence" what was and apparently still is the precise implementation of unintelligence. The fault is in some, many people - as usual.
Very recent state-of-the-art LLM models themselves declare that if the majority of their training data states that entity E is red they will say it's red, and if the majority says it's blue then they will say it's blue: that is the implementation of an artificial moron.
And in fact, very recent state-of-the-art LLM models state cretinous ideas that are child level - because "that's what they have heard" (stuck, moreover analytically, in the simplifications intrinsic in expression).
This architectural fault should be the foremost concern.
Psychological denial of what exactly? And what part of the article/preprints you are commenting on?
Every time an article exposing some limitation of the current wave of LLMs is submitted to HN there are comments like yours and I genuinely cannot understand the point you are trying to make: There is no such thing as a perfect technology, everything has limitations, and we can only improve our current state of the art by studying these and iterate.
I think if we referred to LLMs as AK (Artifical Knowledge) instead of AI it would be easier to have more cohesive discussions.
I don’t see how there can be a single truth when there is not even a single definition of many of the underlying terms (intelligence, AGI, etc) which this discipline supposedly defines itself by. Combine that with a lot of people with little philosophical perspective suddenly being confronted with philosophical topics and you end up with a discourse that personally I’ve mostly given up on participating in until things calm down again.
It feels like nobody remembers all the timelines for which we were supposed to have self-driving cars.
you are I think badly misrepresenting what Yann Le Cun said: he didn't say LLM's were a dead end, he said to do research in directions that do not require billions of dollars of investment to show results, in particular for PhD's this is sensible, and in view of recent cheaper results, prescient
Sensible with the caveat that deepseek R1 still took millions of dollars off compute time, so you're not training the next one on the box in your basement with a pair of 3090s (though you could certainly fine-tune a shared quantized model). you can't run the full sized model on anything cheap, so. basement researcher still need access to a decent amount of funding, which likely requires outside help.
It is becoming more and more important to determine for ourselves what is true and what is not. No person is right on most things, even when they are an expert in that thing. The biggest trap, is to believe someone because they are passionate, that they say it with conviction. Ignore most of the out of band signaling, take what they are saying and then also see if you can corroborate with another source.
There are so many people who are wrong about so may things.
I really appreciate that you are making your dev with ai videos, it shows people different, more humanistic ways of operating with AI.
Most of what I use AI for is to understand and relearn things I only thought I knew. This I think, is the most powerful use of AI, not in the code writing or the image generation, but in understanding and synthesis.
There is that hilarious tautological statement, "it is easy if you know it".
This video https://www.youtube.com/watch?v=TPLPpz6dD3A shows how to use AI to be a personal tutor using the Socratic Method. This is what people should be using AI for, have it test you for things you think you are already good at and you will find huge gaps in your own understanding. Now go apply it to things you have no clue about.
Speaking of parrots, a large volume of the anti AI sentiment, even here is by people repeating half truths they don't understand, confidently, about what AI cannot do. One would need a pretty tight formal case to prove such things.
Everyone should be playing, learning and exploring with these new tools, not shutting each other down.
Yes, the stochastic parrots story is one of the most strong instances in recent times where experts in a field are made blind by their own expertise (the mental model they have of certain things) to the point of being incapable of seeing trivial evidences.
There’s a certain irony in hearing someone describe an LLM as a "stochastic parrot" for the ten-thousandth time when the only reason they’re doing so is that they’ve seen a sufficient number of other people using the exact same term (so now it's in their proverbial training data).
It's an accurate description, though. People are running around convincing themselves that the RNG text generator is alive and it's worth slapping some sense into them.
with recent models, while stitching together training data is still most of the story, there's obviously something more going on, especially with regards to code, mathematics, and with the ability to successfully produce dialogue that respects pragmatics. this has nothing to do with a real thinking being, but it's become different from a stochastic parrot.
there are deluded people who think that Claude and ChatGPT are really thinking and feeling. it is admirable to try to argue them out of this position.
but if your argument involves saying things that are not true, and anyone can easily see are not true by interacting with a consumer-grade chatbot for 10 minutes, it will just discredit you, and those people will go on believing they're talking to HAL-9000.
"Obviously" is doing some heavy lifting there. I don't think it's obvious at all.
What is obvious is people are seeing patterns in the output where there are none. We're hardwired for pattern-seeking and we're great at it. We see faces in clouds and hear words in static.
I've seen a number of people wax about how they interrogate ChatGPT until it gives them "the good stuff" and they're all just Clever Hans-ing themselves.
"It's wild that I took a whole-ass class all about differentiable functions from R^2 to itself that satisfy one specific PDE."
output from Claude (one-shot, no retries or trickery) when asked to "explain the joke":
"The joke is playing off the fact that "complex analysis" (the study of differentiable functions satisfying the Cauchy-Riemann equations - a type of PDE) sounds like it should be more complex than it is when described technically. The humor comes from reducing an entire mathematical field to "functions that satisfy one specific PDE," making it sound surprisingly narrow or trivial despite its actual depth and importance."
Wikipedia and lots of math textbooks are in the training data, so it's not so impressive that this model will produce correct statements about math.
But the joke itself was not in the training data (to the best of my knowledge, it's this guy's original joke). And complex analysis was not mentioned in the joke. Yet somehow the text generated is correct with respect to both of those things.
I see things like this quite regularly, which under the "stochastic parrots" story, can't happen.
I've tried to phrase all these sentences very carefully to not claim there is any "agent" or "intelligence" behind the Claude product. There are many explanations for how a language model like this could imitate intelligent dialogue in ways that are somewhat fake and don't generalize -- I think this is what's happening. I also see things break down all the time and the sleight-of-hand fall apart. However, it is not "stochastic parrots" any more.
I'm sure that there are people that are deluded into thinking ChatGPT loves them like a real life flesh and blood being can even when it says it can't, but we have such limited vocabulary,
especially as laymen, for describing any non-human intelligence, that saying it's thinking and reasoning aren't entirely unreasonable words to describe what it's doing. sure, it's not thinking in the same way a human would, but when a computer, pre-LLM, and even pre-Internet, is doing something that requires the user to wait, saying the computer is "thinking" is an entirely accepted practice.
So if we want to get people to stop using the words thinking and reasoning, we have to get replacement words into the lexicon. if I tell an LLM A implies B implies C, and I tell it A is true and it's able to tell me that C is thus also true, sure it's entirely due to that much logic existing in its training corpus, but unless we get to a point where I can say that ChatGPT is dot-producting an essay for me, or some other phrase, saying it's not doing "thinking" is going to fall flat on its face. Hell, Deepseek R1's output for the local model literally says <think>. It may not be thinking in a biological being sense, and it may not be reason in a biological sense to conclude C if A implies B implies C, if A is true, but we lack the common colloquial language to describe it otherwise.
completely agree. people say a classic chess engine is "thinking" when it goes deeper into the search tree, but nobody is confused about that. This colloquial language doesn't bother me so much.
But there really are people who think they are talking to something more than that. Like within a conversation with today's consumer product, they sincerely believe that an actual being is instantiated who has goals and intentions and talks to them.
Using language like "thinking" doesn't bother me. I'm not a stickler for precision language in colloquial speech.
I do think it's important to deflate the hype and give context to what you mean by "thinking" in products, technologies and so on. Calling it a "stochastic parrot" is a bit pithy but not unreasonable. Plus it's memorable.
Another trope that stands out is that someone will take a model, run a battery of tests against it and then make general statements about what LLMs can and cannot do without understanding their architecture, the training data, and the training itself.
And then they dress it up to sound scientific, when really they are making hasty generalizations to support a preconceived bias.
But what for? Human learning is becoming of diminishing utility as the machines improve. For example, I am now able to create computer programs and beautiful artwork without taking the time to master these skills. You could say that I can use art and programming as basic tools to accelerate my learning of bigger things, but whatever that bigger thing is AI is coming for it too. I can't imagine the progress the machines will achieve in 10 years. We'll be replaced.
The reason you overestimate their capabilities is because you use them for things you don’t know anything about. It’s like when your nephew made a simple HTML website for himself twenty years ago that was <h1>Hi I am Mark</h1> — it seemed impressive, but you just didn’t know that it wasn’t. Using LLMs in real world complex cases (in programming or art) instantly reveal their significant shortcomings. They are a very good nephew for making stuff that seem impressive, but a bad expert or consultant.
I'm sorry but they don't "do progress that was anticipated impossible", especially not every two months.
They were predicted to end the software engineering profession for almost four years already. And it just doesn't happen, even though they can bang out a perfect to-do list in React in a matter of seconds.
LLMs have incremental improvements on the quality of their responses as measured by benchmarks. The speed and cost of inference has also been improving. Despite that there was no major breakthrough since GPT 3.
People keep trying to make them reason, and keep failing at it.
> They were predicted to end the software engineering profession for almost four years already
ChatGPT was launched on November 30 2022. Two years and two months ago. The fact that in such a short timeframe you're talking about missed predictions is absurd, but telling of the accelerated timeframe in which we're living. The fact is that currently AI and LLMs are going through a phase of explosive improvement, to the point we can expect enormous improvements in capabilities every six months or so.
I use LLM’s daily so I’m no skeptic. We are not seeing enormous improvements every 6 months, that’s hyperbolic. There has been a significant improvement since GPT 3.5, I’ll give you that, but even in those ~2 years I don’t think I’d describe the improvement as “enormous”. The capabilities are similar with output quality improving by a noticeable degree.
GPT-2 was released November, 2019, so that's five years ago. GPT-3 isn't the advancement to look at, but ChatGPT, which didn't use GPT-3, but a newer RLHF'd model, based off GPT-3, has the 2022 launch date.
If you don’t see the difference in quality of responses between GPT-3.5 as it launched in 2022 and o1/o3 then I don’t know what to tell you. I am using these models daily and the difference is night and day.
In addition to passing bar exam[0], improved performance on medical questions[1], economics questions that experts thought it was years away from[2], all the other things marked in green on page 6 were just the changes from 3.5 to 4: https://arxiv.org/pdf/2303.08774
At this point just paste my comments into ChatGPT and ask it to explain to you what I mean by them. Then paste your response and ask it why it's not addressing the point made. At least use the tool for what it's good for.
So you're saying that it understands you better than I do?
I get that feeling too (in both directions) but this vague and hard to quantify sensation is not what I'd suggest in response to your clearly stated question:
> And what has enormously improved since ChatGPTs launch?
Which is, I think, answered by the things I listed.
It doesn’t understand me, but it could help you understand. What you listed aren’t major unexpected leaps but incremental improvements on things that already were known to be possible.
But you insist on being obstinate. ChatGPT advised me to disengage from this conversation.
ChatGPT did not ace the bar exam -- it was basically percentile graded against a group of people who mostly failed. If compared to real lawyers, it was 15th percentile on the essay portion
15th percentile of passes, on the weakest aspect, is still a big improvement over "not passing". That improvement is what I wish to highlight.
(The observation that 48th percentile (lowest overall from your link, let alone 15th for essays) of passes corresponds to 90th percentile of all exam takers, suggests that perhaps too many humans are taking the exams before they're ready).
ChatGPT launched with GPT-3.5. We're now at o1 and o3-mini and DeepSeek-R1, but even in the last year with GPT-4 and GPT-4o, it became better than almost everything involving text than average human. It writes better than average person, faster and cheaper. It parses unstructured data better than average person. There's large number of everyday tasks for which it's perfectly reasonable today to just throw them at ChatGPT. That's all last 1 to 1.5 years.
I dare say I'm more familiar with the capabilities of the leading models than certain big tech CEOs are, at least judging by their publicly communicated opinions.
I use the 4o very often in my work and it mostly sucks. Sometimes it’s very good, sometimes it has nice knowledge that was faster to find from it than a search engine. Mostly it spouts out unhelpful noise (for my problems).
I’m sure if you need to make a to-do list in react it’s like magic (until the app gets complicated). In real world use, not so much.
(Also I have often code reviewed PRs from people who are heavy users and surprise surprise - their output is trash and very prone to bugs or being out of spec.)
I also think 4o sucks, but have you tried DeepSeek R1 (free on their website)? I thought it night and day between 4o and o3-mini on the following topics:
- reverse engineering: when fed assembly (or decomp or mock impl), it's been consistently been able to figure out what the function actually does/why it's there from a high-level perspective. Whereas ChatGPT merely states the obvious
- very technical C++ questions: DSR1 gives much more detailed answers, with bullet points and examples. Much better writing style. Slightly prone to hallucinations, but not that much
- any controversial topic: ChatGPT models are trained to avoid these because of its "safety" training
ChatGPT is a bit better (and faster) at writing simple code and doing some math faster, but that's it.
(obviously, common sense about what to share and not to share with these chatbots still apply, etc.)
You can access DeepSeek R1? For me, both chat and API have been down for over a week now (it shut down minutes after I topped up my account and generated an API key - I never got to use it :/).
There's lots of fiddling with these models. I found Claude 3.5 Sonnet to be superior to both GPT-4o and o1-preview in around 99% of the things I do; I only started comparing it against o3-mini, and right now it's a mixed bag. Then again, I tend to develop and refine specific prompts for Sonnet, which I haven't for o1-preview and o3-mini, so that could be a factor. Etc.
Yes, well, I live in the EU and thus can avoid US work hours and Chinese peak hours. I think availability has been a bit better since they disabled websearch (also I noticed DSR1 half a week before it made the mainstream news).
I live in the EU too. For me, the status page[0] shows a continuous API outage for the past 8 days, that is still ongoing. Since it started, my API requests bounce back with an error, which changes seemingly at random between "unauthorized" and "insufficient balance". Neither of those reasons are valid, since I'm using a valid API key I made after creating an account, which I topped up with $20 (and have an invoice from them to prove it). I must have had a mightily bad luck that the service went down soon after I generated the API key - I'm guessing my user/key is currently stuck in the middle of some migration, or possibly wasn't captured in a backup and got subsequently wiped. For now, I'll just patiently wait for them to fix their service.
AFAIK it's hosted on Chutes for free too (though limited to between 2k and 10k output tokens). Azure as well, though it might be ratelimited there (or at least it is through openrouter)
This is a good question. According to some the growth is exponential. Others think ChatGPT is basically still the same as it was at the end of 2022, minor differences aside. Why are the perspectives so different?
One of these archetypes is drowning in a hype-fueled news cycle, they mistake speculation for inevitability, dismiss skepticism as ignorance, and construct a reality where the technology's success is unquestionable.
The other is simply using the technology for what it's good for, observing that it's slowly, incrementally improving at tasks that it was already capable of since the major breakthrough, and acknowledging its limitations.
Incremental improvements don't give us any assurance that another major breakthrough is waiting around the corner.
SE is a good example - I get a lot of help from LLM tools and I think we're learning how to use them better across realistic SDLC processes as well, but we're not replacing lots of people at the moment. On the other hand I saw a business case from one of the big SI's (not my employer but in a deck that was shown by the SI in an discussion) that described the need to move their Indian software dev workforce from 350k FTE to 50K FTE over the next five years.
I think that the onshore impacts will be much lower or negligible, or possibly even positive, because so much work has been offshored already, and as is well worn in every discussion, Jevons paradox may drive up demand significantly (to be fair I believe this as wherever I have worked we've had 3x+ demand (with business cases) for development projects and had to arbitrarily cull 2x of it at the beginning of each year. So, just like the 30 people in India that are working on my project won't do anything useful unless we feed the work to them, the LLM's won't do anything useful either. And just like we have to send lots of work back to India because it's not right, the same is true of LLM's. The difference is that I won't spend 4 hrs on a friday afternoon on Teams discussing it.
But this is not surprising because we've had big impacts from tools like IDE's, VM's, and compilers which have driven seismic changes in our profession, I think that LLM's are just another one of those.
What I'm watching for is an impact in a non tech domain like healthcare or social care. These are important domains that are overwhelmed with demand and riddled with makework, yet so far LLM's have made very little impact. At least, I am not seeing health insurance rates being cut, hospital waiting lists fall or money and staff being redeployed from back office functions to front line functions.
Did you read the article? Dziri and Peng are not the “skeptical AI community,” they are in fact die hard AI researchers. This is like saying people who run benchmarks to find performance problems in code are skeptics or haters.
I read the article: it does not look like very good research: It's simple to find flaws in LLMs reasoning / compositional capabilities looking at problems that are at the limit of what they can do now, or just picking problems that are very far from their computational model, or submitting riddles. But there is no good analysis of the limitations, nor inspection of how/how much better recently LLMs got exactly at this kind of problems. Also the article is full of uninformative and obvious things to show how LLMs fail in stupid tasks such as multiplication between large numbers.
But the most absurd thing is that the paper looks at computational complexity in terms of direct function composition, and there is no reason an LLM should just use this kind of model when emitting many tokens. Note that even when CoT is not explicit, the LLM output that starts to shape the thinking process still makes it able to have technically unbound layers. With CoT this is even more obvious.
Basically there is no bridge between their restricted model and an LLM.
I think that “part of the scientific community” actually wants to do what needs to be done: “We have to really understand what’s going on under the hood,” she said. “If we crack how they perform a task and how they reason, we can probably fix them. But if we don’t know, that’s where it’s really hard to do anything.”
Well, there appears to be evolution in human perception of capabilities of LLMs. An example, the 'stochastic parrots' notion seems to have mostly died out, at least in HN comments.
What a poorly informed article. It's very shallow and out of touch with LLM research. As it stands 6-12 months old models are system 1 thinkers, everybody knows this and knew this even at the time. You need system 2 thinking (test time compute) for more complex logical, algorithmic and reasoning tasks. We knew this when Daniel kahneman wrote thinking fast, thinking slow (over a decade ago) and we still know it today. So LLMs can think but they have to be programmed to think (a la system 2, reasoning, thinking models). There's nothing inherently wrong or limited with LLMs themselves as far as we can tell.
This is an example of "metaphor-driven development" in AI, which Phil Agre criticized a few decades ago.
System 1/System 2 isn't a real thing. It's just a metaphor Kahneman invented for a book. AI developers continually find metaphors about the brain, decide they are real, implement something which they give the same name, decide it's both real and the same thing because they have given it the same name, and then find it doesn't work.
(Another common example is "world model", something which has never had a clear meaning, and if you did define it you'd find that people don't have one and don't need one.)
Idk, `o3-mini-high` was able to pop this Prolog code out in about 20 seconds:
solve(WaterDrinker, ZebraOwner) :-
% H01: Five houses with positions 1..5.
Houses = [ house(1, _, norwegian, _, _, _), % H10: Norwegian lives in the first house.
house(2, blue, _, _, _, _), % H15: Since the Norwegian lives next to the blue house,
house(3, _, _, milk, _, _), % and house1 is Norwegian, house2 must be blue.
house(4, _, _, _, _, _),
house(5, _, _, _, _, _) ],
% H02: The Englishman lives in the red house.
member(house(_, red, englishman, _, _, _), Houses),
% H03: The Spaniard owns the dog.
member(house(_, _, spaniard, _, dog, _), Houses),
% H04: Coffee is drunk in the green house.
member(house(_, green, _, coffee, _, _), Houses),
% H05: The Ukrainian drinks tea.
member(house(_, _, ukrainian, tea, _, _), Houses),
% H06: The green house is immediately to the right of the ivory house.
right_of(house(_, green, _, _, _, _), house(_, ivory, _, _, _, _), Houses),
% H07: The Old Gold smoker owns snails.
member(house(_, _, _, _, snails, old_gold), Houses),
% H08: Kools are smoked in the yellow house.
member(house(_, yellow, _, _, _, kools), Houses),
% H11: The man who smokes Chesterfields lives in the house next to the man with the fox.
next_to(house(_, _, _, _, _, chesterfields), house(_, _, _, _, fox, _), Houses),
% H12: Kools are smoked in a house next to the house where the horse is kept.
next_to(house(_, _, _, _, horse, _), house(_, _, _, _, _, kools), Houses),
% H13: The Lucky Strike smoker drinks orange juice.
member(house(_, _, _, orange_juice, _, lucky_strike), Houses),
% H14: The Japanese smokes Parliaments.
member(house(_, _, japanese, _, _, parliaments), Houses),
% (H09 is built in: Milk is drunk in the middle house, i.e. house3.)
% Finally, find out:
% Q1: Who drinks water?
member(house(_, _, WaterDrinker, water, _, _), Houses),
% Q2: Who owns the zebra?
member(house(_, _, ZebraOwner, _, zebra, _), Houses).
right_of(Right, Left, Houses) :-
nextto(Left, Right, Houses).
next_to(X, Y, Houses) :-
nextto(X, Y, Houses);
nextto(Y, X, Houses).
Seems ok to me.
?- solve(WaterDrinker, ZebraOwner).
WaterDrinker = norwegian,
ZebraOwner = japanese .
That's because it uses a long CoT. The actual paper [1] [2] talks about the limitations of decoder-only transformers predicting the reply directly, although it also establishes the benefits of CoT for composition.
This is all known for a long time and makes intuitive sense - you can't squeeze more computation from it than it can provide. The authors just formally proved it (which is no small deal). And Quanta is being dramatic with conclusions and headlines, as always.
LLMs using CoT are also decoder-only, it's not a paradigm shift like people want to claim now to don't say they were wrong: it's still next token prediction, that is forced to explore more possibilities in the space it contains. And with R1-Zero we also know that LLMs can train themselves to do so.
gpt-4o, asked to produce swi-prolog code, gets the same result using a very similar code. gpt4-turbo can do it with slightly less nice code. gpt-3.5-turbo struggled to get the syntax correct but I think with some better prompting could manage it.
COT is defiantly optional. Although I am sure all LLM have seen this problem explained and solved in training data.
This doesn't include Encoder-Decoder Transformer Fusion for machine translation, or Encoder-Only like text classification, named entity recognition or BERT.
The LLM doesn't understand it's doing this, though. It pattern matched against your "steering" in a way that generalized. And it didn't hallucinate in this particular case. That's still cherry picking, and you wouldn't trust this to turn a $500k screw.
I feel like we're at 2004 Darpa Grand Challenge level, but we're nowhere near solving all of the issues required to run this on public streets. It's impressive, but leaves an enormous amount to be desired.
I think we'll get there, but I don't think it'll be in just a few short years. The companies hyping that this accelerated timeline is just around the corner are doing so out of existential need to keep the funding flowing.
I'm certain models like o3-mini are capable of writing Prolog of this quality for puzzles they haven't seen before - it feels like a very straight-forward conversion operation for them.
My comment got eaten by HN, but I think LLMs should be used as the glue between logic systems like prolog, with inductive, deductive and abductive reasoning being handed off to a tool. LLMs are great at pattern matching, but forcing them to reason seems like an out of envelope use.
Prolog would be how I would solve puzzles like that as well. It is like calling someone weak for using a spreadsheet or a calculator.
I actually coincidentally tried this yesterday on variants of the "surgeon can't operate on boy" puzzle. It didn't help, LLMs still can't reliably solve it.
(All current commercial LLMs are badly overfit on this puzzle, so if you try changing parts of it they'll get stuck and try to give the original answer in ways that don't make sense.)
If the LLM’s user indicates that the input can and should be translated as a logic problem, and then the user runs that definition in an external Prolog solver, what’s the LLM really doing here? Probabilistically mapping a logic problem to Prolog? That’s not quite the LLM solving the problem.
Not the user you’re replying to, but I would feel differently if the LLM responded with “This is a problem I can’t reliably solve by myself, but there’s a logic programming system called Prolog for which I could write a suitable program that would. Do you have access to a Prolog interpreter, or could you give me access to one? I could also just output the Prolog program if you like.”
Furthermore, the LLM does know how Prolog’s unification algorithm works (in the sense that it can provide an explanation of how Prolog and the algorithm works), yet it isn’t able to follow that algorithm by itself like a human could (with pen and paper), even for simple Prolog programs whose execution would fit into the resource constraints.
This is part of the gap that I see to true human-level intelligence.
If an LLM can solve a riddle of arbitrary complexity that is not similar to an already-solved riddle, have the LLM solve the riddle "how can this trained machine-learning model be adjusted to improve its riddle-solving abilities without regressing in any other meaningful capability".
It's apparent that particular riddle is not presently solved successfully by LLMs, as if it were solved, humans would be having LLMs improve themselves in the wild.
So, constructively, there exists at least one riddle that doesn't have a pattern similar to existing ones, where that riddle is unsolvable by any existing LLM.
If you present a SINGLE riddle an LLM can solve, people will reply that particular riddle isn't good enough. In order to succeed they need to solve all the riddles, including the one I presented above.
It's quite the opposite. Converting to words like yours, the argument is "could a powerful but not omnipotent god make themself more powerful", and the answer is "probably".
If the god cannot grant themself powers they're not very powerful at all, are they?
Good point. LLMs can be treated as "theories" and then they definitely meet falsifiability [1] allowing researchers finding "black swans" for years to come. Theories in this case can be different. But if the theory is of logical or symbolic solver then Wolfram's Mathematica may be struggle with understanding the human language as an input, but when evaluating the results, well, I think Stephen (Wolfram) can sleep soundly, at least for now
There's so much talk about the advancements in AI/LLMs, yet for me ChatGPT as of this date is basically just a faster search engine without cookie banners, clickbait and ads. It hallucinates a lot and it can keep very limited context. Why is there so much promise about future progress but so little actual progress?
It's the same cycle we saw with Crypto, there's so much money flying around that the motivation to "believe" is overwhelming. The hype is coming from all directions, and people are social animals that put greater weight on words that come from multiple sources. It's also a platform for people excited about the future to fantasize, and for people terrified of the future to catastrophize.
One must now ask whether research results are analyzing pure LLMs (eg. gpt-series) or LLM synthesis engines (eg. o-series, r-series). In this case, the headline is summarizing a paper originally published in 2023 and does not necessarily have bearing on new synthesis engines. In fact, evidence strongly suggests the opposite given o3's significant performance on ARC-AGI-1 which requires on-the-fly composition capability.
It's Quanta being misleading. They mention several papers but end up with this [1] which talks about decoder-only transformers, not LLMs in general, chatbots, or LLM synthesis engines, whatever that means. The paper also proves that CoT-like planning lets you squeeze more computation from a transformer, which is... obvious? but formally proven this time. Models trained to do CoT don't have some magical on-the-fly compositional ability, they just invest more computation (could be dozens millions of tokens in case of o3 solving the tasks from that benchmark).
I've managed to get llms fail on simple questions, that require thinking graphically - 2D or 3D.
An example would be: you have a NxM grid. How many shapes of XYZ shape can you fit on it?
However, thinking of the transformer video games, AI can be trained to have a good representation of 2D/3D worlds. I wonder how it can be combined so that this graphical representation is used to compute text output.
When one of these limitations gets spelled out in an article, it feels like six months later, somebody has a demo of a chatbot without that particular limitation.
These limitations don’t seem in any way “fundamental” to me. I’m sure there are a ton of people gluing LLMs to SAT solvers as we speak.
Could you give an example of something we recently solved that was considered an unsolvable problem six months beforehand? I don’t have any specific examples, but it seems like most of the huge breakthrough discoveries I’ve seen announced end up being overstated and for practical usage, our choice of LLM-driven tools is only marginally better than they were a couple of years ago. It seems like the preponderance of practical advancement in recent times has come from the tooling/interface improvements rather than generating miracles from the models themselves. But it could be that I just don’t have the right use cases.
Take a look at the ARC Prize, which is a test for achieving "AGI" created in 2019 by François Chollet. Scroll down halfway on the home page and ponder the steep yellow line on the graph. That's what OpenAI o3 recently achieved.
unless you train directly against solving those problems... in which case how could you theoretically design a test that could stand against training directly against the answer sheet?
I don't think people really appreciate how simple ARC-AGI-1 was, and what solving it really means.
It was designed as the simplest, most basic assessment of fluid intelligence possible. Failure to pass signifies a near-total inability to adapt or problem-solve in unfamiliar situations.
Passing it means your system exhibits non-zero fluid intelligence -- you're finally looking at something that isn't pure memorized skill. But it says rather little about how intelligent your system is, or how close to human intelligence it is.
> designed as the simplest, most basic assessment of fluid intelligence possible.
This was the goal, but that doesn't say what the test itself is. Try to get a human to solve this problem without their visual cortex, they couldn't do it. Stating your goal for a thing, doesn't make the thing that goal.
AI researchers designing intelligence tests are like programmers designing their own cryptography.
How about we have people skilled in neuropsychology, psychometrics and cognitive psychology do what they are good at.
> How about we have people skilled in neuropsychology, psychometrics and cognitive psychology do what they are good at.
Disagree. The thing that we will eventually call AGI will not be human. No need to have human-specific evaluations unless you’re aiming for an artificial human and not just an artificial intelligence.
> But why ignore a huge body of research in how to write scientific tests of intelligence and cognition?
Not saying to ignore it, but we are not dealing with humans. Those tests may give misleading results as you're proposing to use them outside of their design envelope. This is an area of research in itself.
Yes, the 12% impact is significant, especially in a societal context, because it represents a shift in how people access and process information. Even without AGI, the comparison between LLMs and search engines is crucial. LLMs provide synthesized, conversational responses rather than just indexing and ranking web pages. This shift reduces the need for users to evaluate multiple sources manually, which has far-reaching implications.
You would think that the training set for the models already included enough of Mensa etc iq tests so that the model knows how to do these kinds of tests. It takes humans 2 or at most 3 examples to "get" what the test is asking for, and then they can start filling the answers to the actual questions. Meanwhile it takes hundreds of answers at least (in the public set) to train o3 to do this test.
The need for a huge training set to solve simple questions has never stopped bewildering me. I think to get a human-like intelligent model we need to figure out why humans learn from 2 examples and the models don't. But I don't mean to say that the current models aren't intelligent in their own way or aren't useful already.
Human intelligence is bootstrapped by biological evolution and the society, neither of which is fast or efficient. Truly individual part of the intelligence is tiny, it's vastly overrated and relies on these two. Similarly, LLMs perform in-context learning, which is much more efficient because it relies on the pre-baked knowledge. Yes, the generalization ability is still incomparable to humans, but it's entirely possible that much better ability is achievable by slowly bootstrapping it.
I guess what I'm probably not seeing from my vantage point is that translating into a better experience with the tools available. I just cancelled a ChatGPT plus subscription because it just didn't seem useful enough to justify the price. I absolutely understand that there are people for whom it is, but nearly everyone I see that talks a lot about the value of AI either has use cases that I don't care about such as automated "content" generation or high-volume lowish-skill code generation, or they see achieving a progressively more difficult set of benchmarks as a useful end in itself. I like copilot autocomplete when I'm coding, but the quality of that hasn't dramatically changed. I don't give a damn about benchmarks-- I only care what I get from it practically. I have absolutely no interest in using ChatGPT as a therapist or companion because I value human connection and have access to it. So far I simply don't see significant changes in what comes out vs what gets typed in for practical usage. I wouldn't give ChatGPT logic problems to solve except maybe for generating code because I know code well enough to quickly evaluate its output. If the caveat is "hey FYI this thing might hide some frustratingly plausible looking bullshit in the answer so double-check its work," then what good is it really for hard problems if you just have to re-do them anyway?
The same thing is true with image generation. Sure, it's better in ways that are sort-of meaningful for low-value professional or hobby usage, but it's barely budged the barriers to becoming good enough for high-end media production.
I totally believe that this technology is improving and when you're looking at it in isolation, those improvements seem meaningful. But I just don't see that yet translating into things most of the general public can sink their teeth into. With things like the (still) shitty google search "enhancements", and users being forced into AI-driven chat workflows or having big loud not-really-useful UI elements dedicated to AI features, in some ways they've made people's experience using computers meaningfully worse.
Just like with Mastodon, I see a huge disconnect with the tech crowd's excitement with what's happening with the technology, and how that ends up working for users that need to actually solve their problems with that technology.
By the time these academic studies get published they are usually already several months out of date. o3-mini was released yesterday and if one wants to know about the limitations of current technology they are much better to check twitter than some research paper
I think the breathless hype train of twitter is probably the worst place to get an actually grounded take on what the real world implications of the technology is.
Seeing the 100th example of an llm generating some toy code for which there are a vast number of examples of approximately similar things in the training corpus doesn’t give you a clearer view of what is or isn’t possible.
I think that most of the developers who advocate for AI coding have never worked all by themselves on projects with over 500/1000 files. Because if they had they would not advocate for AI coding.
I posted this earlier, but I wanted a java port of sed for ... Reasons, and despite the existence of man pages and source code it couldn't do anything but the most basic flags.
Imo this should be low hanging fruit. Porting non-trivial but yet 3-4 core code files that are already debugged and interface specified should be what an LLM excels at.
I tried this with Microsoft's Copilot + the Think Deeper button. That allegedly uses the new o1 model. It goes into a lot of fancy talk about...pretty much what you said older models did. Then it said "here's some other stuff you could extend this with!" and a list of all the important sed functionality.
It's possible it could do it if prompted to finish the code with those things, but I don't know the secret number of fancy o1 uses I get and I don't want to burn them on something that's not for me.
You should be able to access it here if you have a Microsoft account and want to try the button: https://copilot.microsoft.com/
I have a good use case for them: Communication with the bureaucracy of my country. I tell my LLM of choice to write a letter to $whoever about $whatever, then I print it out (yes, we still have to do this as email don't get accepted) and send if off. I don't even need to proof read it because if there's a mistake the bureaucracy will tell me in another letter. So the burden of correctness checking is on some bureaucrat which saves me time and mental resources.
I wouldn't ever use a LLM for anything where correctness matters (code) because I'd spend the same amount of time checking the generated code as writing it myself. But a letter to my tax office? Heck, why not. If something goes really wrong I can always say "gee, I made a mistake let's try it again".
So what, you use it to spam and waste other people's time? I know, dealing government bureaucracy and corruption is soul leeching but spam was always one of the golden usecases for generated AI.
Sending official letters to the local government isn't spam, and generally not a waste of time.
People with cognitive issues, issues typing, language or presentation issues, LLMs provide a massive improvement in how they are percieved and recieved by the other side. Also, immigrants or people with langauge issues aren't quite as disadvantaged and don't need to use excess time translating or risking an embarrasing misstatement. It's a night or day accommodation tool in the right circumstances.
No, I don't just send them random letters. I reply to mail I get from them or when I need them to do something (like adjust my tax pre-pay).
Also one could argue that bureaucracies only exist to create bullshit jobs and waste citizens' time. So I wouldn't even feel bad about spamming those assholes.
Godel and Turing just proved that there are some true things that can't be proved, and things that cannot be computed. They didn't show where those boundaries are.
They certainly didn't show those boundaries to be below human cognition level.
Godel proved that there are unprovable statements. Turing showed that certain classes of problems can only be solved by machines with infinite tapes. This no bounded LLM can possibly solve every turing complete problem. Only theoretically infinite chain of thought can possibly get us that power.
Godel then tells us that, if we have such a system, there are things where this system may get stuck.
Indeed this is what we see in chain of thought models. If you give them an impossible problem they either give up or produce a seemingly infinite series of tokens before emitting the </think> tag.
Turing tells us that examining any set of matrices modeling a finite state machine over an infinite token stream is the halting problem.
Theoretical computability is of dubious practical relevance.
Consider two problems:
Problem A is not computable
Problem B is computable in principle, but, even for trivially sized inputs, the best possible algorithm requires time and/or space we’ll never have in practice, orders of magnitude too large for our physical universe
From a theoretical computer science perspective, there is a huge difference between A and B. From a practical perspective, there is none whatsoever.
The real question is “can AIs do anything humans can do?” And appealing to what Turing machines can or can’t do is irrelevant, because there are a literally infinite number of problems which a Turing machine can solve, but no human nor AI ever could
So the article is about what humans v LLMs can do, except in the article, LLM is taken to mean just a single output auto regressive model (no chain of thought). Since an LLM has a constant number of steps at each token generation, no it cannot do everything a human can. Humans can choose when to think and can ponder the next action interminably. That's my point. When we force LLMs to commit to a particular answer by forcing an output at each token generation, the class of problems they can solve is trivially less than the equivalent human.
I agree that a raw autoregressive LLM model with just a single output is (almost necessarily) less capable than humans. Not only can we ponder (chain of thought style), we also have various means available to us to check our work – e.g. for a coding problem, we can write the code, see if it compiles and runs and passes our tests, and if it doesn't, we can look at the error messages, add debugging, try some changes, and do that iteratively until we hopefully reach a solution–or else we give up – which the constraint "single output" denies.
I don't think anyone is actually expecting "AGI" to be achieved by a model labouring under such extreme limitations as a single output autoregressive LLM is. If instead we are talking about an AI agent with not just chain of thought, but also function calling to invoke various tools (including to write and run code), the ability to store and retrieve information with a RAG, etc – well, current versions of that aren't "AGI" either, but it seems much more plausible that they might eventually evolve into it.
I don't think we need to invoke Turing or Gödel in order to make the point I just made, and I think doing so is more distracting with irrelevancies than actually enlightening.
Yeah, the grounded take is that Turing and Gödel apply just as much to human intelligence. If not, someone please go ahead and use this to physically prove the existence of an immortal, hypercomputational soul.
That's not the point. Computer program with a finite number of steps (an auto regressive LLM without chain of thought) has a limit in what it can reason in one step. This article does a lot of wordcelling to show this obvious point.
That seems irrelevant to Gödel? If that was your point, you should have said that rather than the things about Turing and Gödel (which leads people to expect you are talking about the halting problem and incompleteness, not the limitations that come from a limited depth circuit)
I don't think individual examples make sense to solve these kinds of discussions as for me it can vary easily by 6x thinking with exactly the same input and parameters.
> Chatbot Software Begins to Face Fundamental Limitations
> Recent results show that large language models struggle with compositional tasks, suggesting a hard limit to their abilities.
Your first question with anything like this should always be WHICH MODELS:
> For our experiments, we evaluate the performance of 6 LLMs: GPT4 (gpt-4) [58], ChatGPT
(GPT3.5-turbo) [57], GPT3 (text-davinci-003) [11], FlanT5 [17] and LLaMa [75].
This is ancient. This research was done centuries ago. This is research about the possibility of isotopes, written about radium in 1903, published in 1946. It is a criminal level of journalistic malpractice to leave uninformed readers with the impression that this is where AI stands yesterday.
I'd like to propose a modified Zebra puzzle. Still has unique solutions. Not on the internet, so not part of the training set. Similar problem description complexity. Just remove the last constraint (Norwegian lives next to the blue house) and asked negated questions: Who cannot be the water drinker? Who cannot own the zebra? Text of problem here: sandiway.arizona.edu/mzebra.pdf
(I tested it on free ChatGPT and DeepSeek-R1 https://youtu.be/3gotauWUcew)
This is ultimately a basic adaptation of the pigeonhole principle and is not surprising. A finite system of matrix multiplications cannot be turing complete. You cannot expect one trip through a series of matrix multiplications and bias additions and a final sampling at the end which commits it to a certain answer to ever produce a correct answer. It's a mathematical impossibility. No talk of quantum woo, emergent phenomenen, or whatever other pseudo-science has arisen to explain AI intelligence can get around this simple truth of mathematics.
However, chain of thought reasoning where token streams can continue ad infinitum could potentially solve large swaths of problems whose ordinary solutions require turing machines. It could also solve problems that cannot generally be solved by turing machines, but where you only need solutions for a few classes of problems.
Either way, even with chain of thought, you would expect that... in some instances, the model output diverges and does not complete. And unsurprisingly... this is exactly what you see with the DeepSeek models (and other CoT models) when you pose it difficult questions. It will never emit the </think> tag.
I'm a software developer with just a basic understanding of LLMs.
But its not surprising to me that LLMs wouldnt be good at composition and planning.
To me, LLMs seem like half a brain, the right half, that does pattern matching and such.
To me, the current crop of AI seems to be missing the other half of the brain, the part that's good at planning and composition and such.
LLMs can tell me all I want to know about logic and deduction, so its funny that LLMs cant understand their own words.
> multilayer transformers indeed cannot solve certain complicated compositional tasks
> chain-of-thought prompting essentially turns a large problem into a sequence of smaller problems, making it possible for transformers to tackle more complex compositional tasks
--- [out of order]
> the model could be trained on 20-digit numbers and still reliably (with 98% accuracy) add 100-digit numbers, whereas a model trained without the extra positional embedding was only about 3% accurate
98% seems pretty damn unreliable to me given that a computer ought to be able to do that addition with probably about 12 nines instead of 1 without really putting much effort into reliability.
Over my life I've been "trained" that humans are unreliable and subject-to-change while computers are deterministic and will give the same output every time (given the same input.)
LLMs though are very human. Ask it the same question and you'll get different answers. Sometimes grammatically different, sometimes factually different.
This is partly why hallucinations freaked people out. I expect a correct, and deterministic answer. And it just made stuff up. Then it made up URLs to validate the answer.
In other words you have computers behaving like humans. Which was indeed the whole point of them to begin with.
Can you add 100 digit numbers reliably in a short amount of time over a large sample size? How about without a piece of paper and a pen? Can you even remember a 100 digit number? It’s likely quite difficult…
This is not about comparing to reliability “of a computer” but rather to a process which responds to a wide array of natural language queries (both human processes an algorithmic processes)… and obviously the reliability goes up once tool use is included.
At pace though? What would your throughput be to maintain 98% reliability? How long do you think one execution would take? my guess would be between 30s-1m for a single summation of two 100 digit numbers? So let’s say you want a sample size of 200 operations to estimate reliability… maybe something like 2-4hours to get through the computations working non stop? That’s actually a nontrivial amount of focus on a highly repetitive task, which isn’t easy for a lot of people. I’m now genuinely curious what the average HN user’s reliability and throughput would be for adding together two random hundred digit numbers working as fast as they can.
The point made by a commenter a few levels up questioned whether 98% reliability for adding two 100-digit numbers together should be considered noteworthy/acceptable etc.
My point is that assessing 98% reliability qualitatively depends on whatever kind of baseline system you are comparing against.
Obviously this reliability (and speed) of an LLM with 98% accuracy is atrocious compared to any basic automated arithmetic computation system. If you need to reliably add together 100 digit numbers, of course an LLM is a bad choice.
However, comparing it to the general class of systems which an LLM belongs to - ie “things that can add 100-digit numbers together quickly and also tell you about this history of the Ottoman Empire and a bunch of other stuff too, all within a relatively short amount of time” - this reliability is (potentially) impressive. We don’t have many such systems - a moderately educated human; a generic bespoke piece of software which process natural language requests and attempts to decide whether or not to retrieve information from a database/the internet or write and execute code, but could conceivably have been written in a crude manner 5 years ago; and LLMs (which might also take advantage of tool use as suggested in the previous example).
In this context - comparing to other systems with similar capabilities - the 98% reliability might be considered impressive, especially when you consider that it likely is much higher for more common simpler arithmetic problems.
My intention in bringing up “an average HN user” was to bring up an example of a reasonable benchmark to compare to, in my opinion. A HN user is a stand-in for a moderately educated person, which could reasonably respond to a variety of generic natural language requests in a short amount of time. My point was simply that adding together 100 digit numbers reliably and quickly while also being able to chat about Shakespeare or Kalman filters or planning a wedding is likely a more difficult task than we tend to give credence to, due to our familiarity with specialized systems which do arithmetic extremely well and extremely quickly.
Maybe it's just me but when i code/program i typically think of how different variations of implementations would affect roi of the associated company i would be working for.
That's something that AI in the current state probably wouldn't be able to take into account.
Isn't the math calculation thing solvable through the integration with a simple Python script for basic math and more complex stuff something like an open source version of Wolfram?
H01 There are five houses.
H02 The Englishman lives in the red house.
H03 The Spaniard owns the dog.
H04 Coffee is drunk in the green house.
H05 The Ukrainian drinks tea.
H06 The green house is immediately to the right of the ivory house.
H07 The Old Gold smoker owns snails.
H08 Kools are smoked in the yellow house.
H09 Milk is drunk in the middle house.
H10 The Norwegian lives in the first house.
H11 The man who smokes Chesterfields lives in the house next to the man with the fox.
H12 Kools are smoked in a house next to the house where the horse is kept.
H13 The Lucky Strike smoker drinks orange juice.
H14 The Japanese smokes Parliaments.
H15 The Norwegian lives next to the blue house.
Now,
Q1 Who drinks water?
Q2 Who owns the zebra?
I ran a few replacements because the puzzle is probably in the training data somewhere:
Englishman -> Scotsman
Spaniard -> Welshman
dog -> parrot
Coffee -> Kombucha
green -> scarlet
tea -> butterscotch
Ukrainian -> Romanian
ivory -> pink
snails -> scorpions
yellow -> turquoise
Milk -> Red Bull
Norwegian -> Brazilian
fox -> bear
horse -> mule
orange juice -> rum
Japanese -> German
blue -> brown
red -> purple
Here's my new prompt after replacing those words:
H01 There are five huts.
H02 The Scotsman lives in the purple hut.
H03 The Welshman owns the parrot.
H04 Kombucha is drunk in the scarlet hut.
H05 The Romanian drinks butterscotch.
H06 The scarlet hut is immediately to the right of the pink hut.
H07 The Old Gold smoker owns scorpions.
H08 Kools are smoked in the turquoise hut.
H09 Red Bull is drunk in the middle hut.
H10 The Brazilian lives in the first hut.
H11 The man who smokes Chesterfields lives in the hut next to the man with the bear.
H12 Kools are smoked in a hut next to the hut where the mule is kept.
H13 The Lucky Strike smoker drinks rum.
H14 The German smokes Parliaments.
H15 The Brazilian lives next to the brown hut.
Now,
Q1 Who drinks water?
Q2 Who owns the zebra?
Then I ran the new prompt through o3-mini-high and it gave this answer (after a thought process that looks legit to me, like it actually did reason through the puzzle and not just spit out an answer it had seen):
Q1: Who drinks water?
A: The Brazilian (in Hut 1) drinks water.
Q2: Who owns the zebra?
A: The German (in Hut 5) owns the zebra.
Undoing my replacements, that's the Norwegian drinking water and the Japanese person owning a zebra. Which I think is the solution from the original paper.
I just did a pass with some replacements with o1 and it very much still recognized it as the Einstein riddle and actually seems to have cheated a bit :)
"Revisiting assumptions
Considering "Camels" might be a mistake for "Kools," leading to confusion. This inconsistency complicates solving the puzzle, showing the need for careful brand assignment."
Tracking puzzle progress
I’m mapping out various house and nationality combinations, but the classic conclusion is the Norwegian drinks water and the Japanese owns the zebra.
Analyzing the arrangement
I’m working through the classic puzzle structure and noting variations, while consistently identifying the Norwegian drinking water and the Japanese owning the zebra as the final solution."
Hah, that's fun. My o3-mini-high transcript didn't hint that it recognized the puzzle and looked legit when I scanned through them, but I'm still very suspicious since this is evidently such a classic puzzle.
I should have changed the cigarette brands to something else too.
If you want to make a cosmetic change to the puzzle, you might try eliminating the massive quantity of implicit information in "the green house is immediately to the right of the ivory house".
After doing some substitutions on what it means to be in positions 1/2/3/4/5:
A. If the ivory house is in London, the green house is in Madrid.
B. If the ivory house is in Madrid, the green house is in Kiev.
C. If the ivory house is in Kiev, the green house is in Oslo.
D. If the ivory house is in Oslo, the green house is in Tokyo.
E. The ivory house is not in Tokyo.
9. Milk is drunk in Kiev.
11(A). If the man who smokes Chesterfields lives in Tokyo, the man with the fox lives in Oslo.
12(A). If the man with the horse lives in Oslo, Kools are smoked in either Tokyo or Kiev.
15(A). If the blue house is in Madrid, the Norwegian lives either in London or in Kiev.
[...]
Another easy change is to exchange categories. Swap the animals for the drinks and instead of "the Spaniard owns the dog" and "the Ukrainian drinks tea", you'll have "the Spaniard drinks tea" and "the Ukrainian owns the fox" (depending on which equivalences you decide on). It won't make any difference to the puzzle, but it will permute the answer.
Try flipping the order, adding a few nonsense steps and combining 2 steps into one and also splitting a single step into two. And then see what happens and post it here. :-)
llm -m groq/deepseek-r1-distill-llama-70b '
H01 There are five huts.
H02 The Scotsman lives in the purple hut.
H03 The Welshman owns the parrot.
H04 Kombucha is drunk in the scarlet hut.
H05 The Romanian drinks butterscotch.
H06 The scarlet hut is immediately to the right of the pink hut.
H07 The Old Gold smoker owns scorpions.
H08 Kools are smoked in the turquoise hut.
H09 Red Bull is drunk in the middle hut.
H10 The Brazilian lives in the first hut.
H11 The man who smokes Chesterfields lives in the hut next to the man with the bear.
H12 Kools are smoked in a hut next to the hut where the mule is kept.
H13 The Lucky Strike smoker drinks rum.
H14 The German smokes Parliaments.
H15 The Brazilian lives next to the brown hut.
Now,
Q1 Who drinks water?
Q2 Who owns the zebra?'
I don't - but like I said, I reviewed the thought process in the transcript and it looked legit to me.
I'm not sure what else I could do here to be honest, without coming up with a completely new puzzle that captures the same kind of challenge as the original. I'm not nearly patient enough to do that!
ClaudeAI responded:
>>>
After working through all constraints:
Q1: Who drinks water?
The German drinks water.
Q2: Who owns the zebra?
The Scotsman owns the zebra.
<<<
(Not fair because it is in the training set, but o3-mini-high gives a thorough answer after 66s)
Below is a step‐by‐step reasoning that leads to the well‐known solution of the “Zebra Puzzle.”
Step 1. Number the Houses 1 to 5
We assume the houses are arranged in a row from left to right, with House 1 on the far left and House 5 on the far right.
Step 2. Assign the Easy Clues
1. H10: “The Norwegian lives in the first house.”
→ House 1: Nationality = Norwegian.
2. H15: “The Norwegian lives next to the blue house.”
House 1 has only one neighbor (House 2), so
→ House 2: Color = Blue.
3. H09: “Milk is drunk in the middle house.”
→ House 3: Beverage = Milk.
Step 3. Determine the House Colors with the Ivory/Green Pair
• H06: “The green house is immediately to the right of the ivory house.”
There are two possible placements for this pair: either Houses (3,4) or Houses (4,5).
The option (3,4) would force House 3 to be ivory, but later clues will assign House 3 a color (red, see H02) for the Englishman.
So we take the pair as Houses 4 and 5:
• House 4: Color = Ivory.
• House 5: Color = Green.
• H04: “Coffee is drunk in the green house.”
→ House 5: Beverage = Coffee.
• The remaining colors for Houses 1–3 are now Red, Yellow, and (already used) Blue.
H02: “The Englishman lives in the red house.”
Since House 1 is taken by the Norwegian and House 2 is blue, the only possibility is:
→ House 3: Color = Red, Nationality = Englishman.
• The only color left for House 1 is Yellow.
H08: “Kools are smoked in the yellow house.”
→ House 1: Color = Yellow, Cigarette = Kools.
Step 4. Assign the Remaining Nationalities and Beverages
The nationalities so far:
• House 1: Norwegian
• House 3: Englishman
Remaining nationalities: Ukrainian, Spaniard, Japanese.
• H05: “The Ukrainian drinks tea.”
House 2 already is blue and unassigned for nationality and beverage, so:
→ House 2: Nationality = Ukrainian, Beverage = Tea.
Now for Houses 4 and 5, the remaining nationalities are Spaniard and Japanese.
• H03: “The Spaniard owns the dog.”
And
• H14: “The Japanese smokes Parliaments.”
We still have the beverages and cigarettes to assign. Notice that House 5 (green) already has its beverage (coffee). It is natural then to assign:
• House 5: Nationality = Japanese, Cigarette = Parliaments.
This leaves:
• House 4: Nationality = Spaniard, and by H03,
→ House 4: Pet = Dog.
Step 5. Use the Clues about Cigarettes and Pets
The remaining cigarette brands to assign (across Houses 2, 3, and 4) are: Old Gold, Chesterfields, and Lucky Strike.
• H13: “The Lucky Strike smoker drinks orange juice.”
House 5 already drinks coffee; House 3 has milk; House 2 has tea; so the only house with an unassigned beverage is House 4.
→ House 4: Beverage = Orange juice, Cigarette = Lucky Strike.
• Now, the remaining cigarettes for Houses 2 and 3 are: Old Gold and Chesterfields.
• H07: “The Old Gold smoker owns snails.”
• H11: “The man who smokes Chesterfields lives in the house next to the man with the fox.”
• H12: “Kools are smoked in a house next to the house where the horse is kept.”
House 1 (which smokes Kools) has only one neighbor, House 2, so:
→ House 2: Pet = Horse.
Now, consider Houses 2 and 3 for the remaining cigarette assignments:
• If House 2 were the Old Gold smoker, then by H07 it would have to own snails—but it already has the horse.
So, House 2 must smoke Chesterfields, and consequently,
→ House 2: Cigarette = Chesterfields.
• This forces:
→ House 3: Cigarette = Old Gold, and by H07,
→ House 3: Pet = Snails.
Finally, H11: “The man who smokes Chesterfields lives in the house next to the man with the fox.”
House 2 (smoking Chesterfields) is adjacent to Houses 1 and 3.
House 3 already has snails, so
→ House 1: must have the fox.
→ House 1: Pet = Fox.
The only pet left is the zebra. The only house without a pet now is House 5, so:
→ House 5: Pet = Zebra.
Step 6. Assign the Final Beverage
At this point, the beverages are:
• House 2: Tea
• House 3: Milk
• House 5: Coffee
• House 4: Orange juice
The only beverage not yet used is water. The only house still without a beverage is House 1, so:
→ House 1: Beverage = Water.
Final House Assignments
• House 1:
Color: Yellow
Nationality: Norwegian
Beverage: Water
Cigarette: Kools
Pet: Fox
• House 2:
Color: Blue
Nationality: Ukrainian
Beverage: Tea
Cigarette: Chesterfields
Pet: Horse
• House 3:
Color: Red
Nationality: Englishman
Beverage: Milk
Cigarette: Old Gold
Pet: Snails
• House 4:
Color: Ivory
Nationality: Spaniard
Beverage: Orange juice
Cigarette: Lucky Strike
Pet: Dog
• House 5:
Color: Green
Nationality: Japanese
Beverage: Coffee
Cigarette: Parliaments
Pet: Zebra
Cross Check Against the Clues
• H02: Englishman in the red house → House 3 is red and English.
• H03: Spaniard owns the dog → House 4 (Spaniard) has a dog.
• H04: Coffee in the green house → House 5 (green) has coffee.
• H05: Ukrainian drinks tea → House 2 (Ukrainian) has tea.
• H06: Green house is immediately right of the ivory house → Houses 4 (ivory) and 5 (green).
• H07: Old Gold smoker owns snails → House 3 (Old Gold) has snails.
• H08: Kools in the yellow house → House 1 (yellow) smokes Kools.
• H09: Milk in the middle house → House 3 has milk.
• H10: Norwegian in House 1 → correct.
• H11: Chesterfields smoker (House 2) lives next to fox owner (House 1).
• H12: Kools (House 1) next to horse (House 2).
• H13: Lucky Strike smoker (House 4) drinks orange juice.
• H14: Japanese smokes Parliaments → House 5.
• H15: Norwegian (House 1) lives next to blue house (House 2).
All clues are satisfied.
Final Answers
• Q1: Who drinks water?
→ The Norwegian in House 1 drinks water.
• Q2: Who owns the zebra?
→ The Japanese in House 5 owns the zebra.
Sources and Further Reading
• For a classic discussion of this puzzle and its variations, see “Einstein’s Riddle” as presented in various logic puzzle collections and articles such as those found on Brainzilla or in books on lateral thinking puzzles.
This structured reasoning confirms the answers with a high degree of certainty.
"showing that multilayer transformers indeed cannot solve certain complicated compositional tasks. Basically, some compositional problems will always be beyond the ability of transformer-based LLMs."
Pretty sure this is just false and the paper doesn't show this. I could be misunderstanding, but it looks like the result is only about a single token/forward pass, not a reasoning model with many thousands of tokens like o1/o3
I'm not sure that the statement "some compositional problems will always be beyond the ability of transformer-based LLMs" is even controversial to be honest.
There's a reason all of the AI labs have been leaning hard into tool use and (more recently) inference-scaling compute (o1/o3/Gemini Thinking/R1 etc) recently - those are just some of the techniques you can apply to move beyond the unsurprising limitations of purely guessing-the-next-token.
Huh, yeah that's a good point. The various distilled R1 models are definitely regular transformer-based LLMs because the GGUF file versions of them work without any upgrades to the underlying llama.cpp library.
Quite deep and informative. One thing to point out here is how they measure accuracy when testing their LLMs. Like most neural networks, the n-dimensional space of LLMs is largely sparse. In simple terms, this means that it’s very easy to “exercise” the model in areas where it’s weak. That doesn’t mean that the model doesn’t have the potential of doing a (much) better job if you “exercise” its space properly—which is exactly what techniques like CoT basically do. LLMs are inherently limited, and they don’t have “true” reasoning capabilities—beyond marketing hype, this much should be obvious to most serious ML researchers today. The only question is how well we can get them to “mimic” reasoning for practical applications, and this is where “prompt engineering”, strictly speaking, is a true form of engineering, which has to take into account the mathematical foundations of the models and figure out how to extract out of them the best performance they can deliver.
It feels like every few weeks someone puts out a new article breathlessly talking about how bad the fancy autocorrect bot is at math and logic like it's brand new information. How have we as a society not gotten this through our skulls yet?
Do you really need a study to work out that a Markov chain can’t solve problem questions? It feels pretty intuitive.
LLMs are not intelligent. Human intelligence is stored in text, so mimicking text mimicks that intelligence. The sheer quantity of text means they’ve probably seen something similar to the prompt.
If LLMs were an intelligent system they wouldn’t need to steal 1000TB of text and media content. Have you ever seen a person require a million books? It shows that the learning is brute force rather than intelligent.
It’s close, but it’s just not there. The methodology is wrong.
The Markov chain wouldn't fit in a googol of observable universes. The analogy is absolutely misleading. There's no reason to think that the chain of that humongous size cannot capture at least something of what humans with their limited brain size can do.
It cannot capture non in-context learning, true. But we have tools to modify the "Markov chain".
I guess your point is that it's obvious, but I'd argue it's not.
A lot of people are hyping up LLMs as the solution to all our problems, and there have been a lot of discussion around what's reasonable use cases (and a lot of anecdotal evidence trying to argue and support certain stances). My point is that we need (scientific) proof to move forward, and create confidence in "facts" we base our decisions on. Obviously that doesn't seem to be very fashionable nowadays - on both sides of the LLM discussion there seems to be a lot of people arguing hard for there stances. I'm very glad some are working on providing some facts into this discussion.
(one can argue about science vs truth, but regardless it's a step in the right direction/a good motive)
(this is btw why a solid educational foundation, not only skill focused, is very critical for society. cf democracy)
Chain of thought is interesting, because you can combine it with reinforcement learning to get models to solve (seemingly) arbitrarily hard problems. This comes with the caveat that you need some reward model for all RL. This means you need a clear definition of success, and some way of rewarding being closer to success, to actually solve those problems.
Framing transformer based models as pattern matchers makes all the sense in the world. Pattern matching is obviously vital to human problem solving skills too. Interesting to think about what structures human intelligence has that these models don't. For one, humans can integrate absolutely gargantuan amounts of information extremely efficiently.