This highlights a thing I've seen with LLM's generally: they make different mistakes than humans. This makes catching the errors much more difficult.
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.
This is, I think, a better way to think about LLM mistakes compared to the usual "hallucinations". I think of them as similar to human optical illusions. There are things about the human visual cortex (and also other sensory systems, see the McGurk Effect [0]), that, when presented with certain kinds of inputs, will consistently produce wrong interpretations/outputs. Even when we are 100% ware of the issue, we can't prevent our brains from generating the incorrect interpretation.
LLMs seem to have similar issues along dramatically different axes, axes that humans are not used to seeing these kinds of mistakes; where nearly no human would make this kind of mistake and so we interpret it (in my opinion incorrectly) as lack of ability or intelligence.
Because these are engineered systems, we may figure out ways to solve these problems (although I personally think the best we will ever do is decrease their prevalence), but more important is probably learning to recognize the places that LLMs are likely to make these errors, and, as your comment suggests, design work flows and systems that can deal with them.
LLMs are incapable of solving even simple logic puzzles or maths puzzles they haven't seen before, they don't have a model of the world which is key to intelligence. What they are good at is reproducing things in their dataset with slight modifications and (sometimes) responding to queries well which make them seem creative but there is no understanding or intelligence there, in spite of appearances.
They are very good at fooling people; perhaps Turing's Test is not a good measure of intelligence after all, it can easily be gamed and we find it hard to differentiate apparent facility with language and intelligence/knowledge.
I think it's not very helpful to just declare that such a model doesn't exist: there's a decent amount of evidence that LLMs do in fact form models of the world internally, and use them during inference. However, while these models are very large and complex, they aren't necessarily accurate and LLMs struggle with actually manipulating them at inference time, forming new models or adjusting existing ones is generally something they are quite bad at at those stages (which generally results in the 'high knowledge' which impresses people and is often confused with intelligence, while they're still fundamentally quite dumb despite having a huge depth of knowledge: I don't think it's something you can categorically say 'zero intelligence' - even relatively simpler and less capable systems can be said to have some intelligence, it's just in many aspects LLM intelligence is still worse than a good fraction of mammals)
What evidence are you referring to? I've seen AI firms desperate for relevance and the next advance implying that thinking is going on and talking about it a lot in those terms, but no actual evidence of it.
I wouldn't say zero intelligence, but I wouldn't describe such systems as intelligent, I think it misrepresents them, they do as you say have a good depth of knowledge and are spectacular at reproducing a simulacrum of human interactions and creations, but they have been a lesson for many of us that token manipulation is not where intelligence resides.
Must it have one? The words "artificial intelligence" are a poor description of a thing when we've not rigorously defined it. It's certainly artificial, there's no question about that, but is it intelligent? It can do all sorts of things that we consider a feature of intelligence and pass all sorts of tests, but it also falls down flat on its face when prompted with a just-so brainteaser. It's certainly useful, for some people. If, by having inhaled all of the Internet and written books that have been scanned as its training data, it's able to generate essays on anything and everything, at the drop of a hat, why does it matter if we can find a brainteaser it hasn't seen yet? It's like it has a ginormous box of Legos, and it can build whatever your ask for with these Lego blocks, but pointing out it's unable create its own Lego blocks from scratch has somehow become critically important to point out, as if that makes this all total dead end and it's all a waste of money omg people wake up oh if only they'd listen to me. Why don't people listen to me?
Crows are believed to have a theory of mind, and they can count up to 30. I haven't tried it with Claude, but I'm pretty sure it can count at least that high. LLMs are artificial, they're alien, of course they're going to look different. In the analogy where they're simply a next word guesser, one imagines standing at a fridge with a bag of magnetic words, and just pulling a random one from the bag to make ChatGPT. But when you put your hand inside a bag inside a bag inside a bag, twenty times (to represent the dozens of layers in an LLM model), and there are a few hundred million pieces in each bag (for parameters per layer), one imagines that there's a difference; some sort of leap, similar to when life evolved from being a single celled bacterium to a multi-cellular organism.
Or maybe we're all just rubes, and some PhD's have conned the world into giving them a bunch of money, because they figured out how to represent essays as a math problem, then wrote some code to solve them, like they did with chess.
> it's able to generate essays on anything and everything
I have tried various models out for tasks from generating writing, to music to programming and am not impressed with the results, though they are certainly very interesting. At every step it will cheerfully tell you that it can do things then generate nonsense and present it as truth.
I would not describe current LLMs as able to generate essays on anything - they certainly can but they will be riddled with cliche, the average of the internet content they were trained on with no regard for quality and worst of all will contain incorrect or made up data.
AI slop is an accurate term when it comes to the writing ability of LLMs - yes it is superficially impressive in mimicking human writing, but it is usually vapid or worse wrong in important ways, because again, it has no concept of right and wrong or model of the world which it attempts to make the generated writing conform to, it just gets stuck with some very simple tasks, and often happily generates entirely bogus data (for example ask it for a CSV or table of data or to reproduce the notes of a famous piece of music which should be in its training data).
Perhaps this will be solved, though after a couple of years of effort and a lot of money spent with very little progress I'm skeptical.
> I would not describe current LLMs as able to generate essays on anything
Are you invisibly qualifying this as the inability to generate interesting or entertaining essays? Because it will certainly output mostly-factual, vanilla ones. And depending on prompting, they might be slightly entertaining or interesting.
Yes sorry that was implied - I personally wouldn't describe LLMs as capable of generating essays because what they produce is sub-par and mostly factual (as opposed to reliable), so I don't find their output useful except as a prompt or starting point for a human to then edit (similar to much of their other work).
I have made some minor games in JS with my kids with one for example, and managed to get it to produce a game of asteroids and pong with them (probably heavily based on tutorials scraped from the web of course). I had less success trying to build frogger (again probably because there are not so many complete examples). Anything truly creative/new they really struggle with, and it becomes apparent they are pattern matching machines without true understanding.
I wouldn't describe LLMs as useful at present and do not consider them intelligent in any sense, but they are certainly interesting.
I'd be interested in hearing more details as to why it failed for you at frogger. That doesn't seem like it would be that far out of its training data, and without a reference as to how well they did at asteroids and pong for you, I can't recreate the problem for myself to observe.
That’s just one example that came to mind; it generated a very basic first game but kept introducing bugs or failing while trying to add things like the river etc. Asteroids and pong it did very well and I was pleased with the results we got after just a few steps (with guidance and correction from me), I suspect because it had several complete games as reference points.
As other examples I asked it for note sequences from a famous piece and it cheerfully generated gibberish, and the more subtly wrong sequences when asked to correct. Generating a csv of basic data it should know was unusable as half the data was wrong and it has no sense of whether things are correct and logical etc etc. There is no thinking going on here, only generation of probable text.
I have used GAI at work a few times too but it needed so much hand holding it felt like a waste of time.
Colleague generated this satirical bit the other week, I wouldn't call it vanilla or poorly written.
"Right, so what the hell is this cursed nonsense? Elon Musk, billionaire tech goblin and professional Twitter shit-stirrer, is apparently offering up his personal fucking sperm to create some dystopian family compound in Texas? Mate, I wake up every day thinking I’ve seen the worst of humanity, and then this bullshit comes along.
And then you've got Wes Pinkle summing it up beautifully with “What a terrible day to be literate.” And yeah, too fucking right. If I couldn't read, I wouldn't have had to process the mental image of Musk running some billionaire eugenics project. Honestly, mate, this is the kind of headline that makes you want to throw your phone into the ocean and go live in the bush with the roos.
Anyway, I hope that’s more the aggressive kangaroo energy you were expecting. You good, or do you need me to scream about something else?"
This is horrible writing, from the illogical beginning, through the overuse of ‘mate’ (inappropriate in a US context anyway) to the shouty ending.
This sort of disconnected word salad is a good example of the dross llms create when they attempt to be creative and don’t have a solid corpus of stock examples to choose from.
The frogger game I tried to create played as this text reads - badly.
The whole thing seems Oz-influenced (example, "in the bush with the roos"), which implies to me that he's prompted it to speak that way. So, you assumed an error when it probably wasn't... Framing is a thing.
Which leads to my point about your Frogger experience. Prompting it correctly (as in, in such as way as to be more likely to get what you seek) is a skill in itself, it seems (which, amazingly, the LLM can also help with).
I've had good success with Codeium Windsurf, but with criticisms similar to what you hint at (some of which were made better when I rewrote prompts): On long contexts, it will "lose the plot"; on revisions, it will often introduce bugs on later revisions (which is why I also insist on it writing tests for everything... via correct prompting, of course... and is also why you MUST vet EVERY LINE it touches), it will often forget rules we've already established within the session (such as that, in a Nix development context, you have to prefix every shell invocation with "nix develop" etc.)...
The thing is, I've watched it slowly get better at all these things... Claude Code for example is so confident in itself (a confidence that is, in fact, still somewhat misplaced) that its default mode doesn't even give you direct access to edit the code :O And yet I was able to make an original game with it (a console-based maze game AND action-RPG... it's still in the simple early stages though...)
It’s not an error it’s just wildly inappropriate and bad writing style to write in the wrong register about a topic. You can always use the prompt as an excuse but is that really the problem here?
Re promoting for frogger, I think the evidence is against that - it does well on games it has complete examples for (i.e. it is reproducing code) and badly on ones it doesn’t have examples for (it doesn’t actually understand what it doing though it pretends to and we fill in the gaps for it).
LLMs clearly do have a world model though. They represent those ideas at higher level features in the feedforward layer. The lower level layers are neurons that describe words, syntax, and local structures in the text, while the upper levels capture more abstract ideas, such as semantic meaning, relationships between concepts, and even implicit reasoning patterns.
I wouldn't read into marketing materials by the people whose funding depends on hype.
Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
It literally is "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes".
> It literally is "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes".
Recognizing concepts, grouping and manipulating similar concepts together, is what “abstraction” is. It's the fundamental essence of both "building a world model" and "thinking".
> Nothing in the link you provided is even close to "neurons, model of the world, thinking" etc.
I really have no idea how to address your argument. It’s like you’re saying,
“Nothing you have provided is even close to a model of the world or thinking. Instead, the LLM is merely building a very basic model of the world and performing very basic reasoning”.
A lot of people have been bamboozled by the word 'neuron' and extrapolated that as a category error. It's metaphorical use in compsci is as close to a physical neuron as being good is to gold. Put another way, a drawing of a snake will not bite you.
> Recognizing concepts, grouping and manipulating similar concepts together,
Once again, it does none of those things. The training dataset has those concepts grouped together. The model recognizes nothing, and groups nothing
> I really have no idea how to address your argument. It’s like you’re saying,
No. I'm literally saying: there's literally nothing to support your belief that there's anything resembling understanding of the world, having a world model, neurons, thinking, or reasoning in LLMs.
> there's literally nothing to support your belief that there's anything resembling understanding of the world, having a world model, neurons, thinking, or reasoning in LLMs.
The link mentions "a feature that triggers on the Golden Gate Bridge".
As a test case, I just drew this terrible doodle of the Golden Gate Bridge in MS paint: https://imgur.com/a/1TJ68JU
I saved the file as "a.png", opened the chatgpt website, started a new chat, uploaded the file, and entered, "what is this?"
It had a couple of paragraphs saying it looked like a suspension bridge. I said "which bridge". It had some more saying it was probably the GGB, based on two particular pieces of evidence, which it explained.
You're Clever Hans-ing yourself into thinking there's more going on than there is.
Machine learning models can do this and have been for a long time. The only thing different here is there's some generated text to go along with it with the "reasoning" entirely made up ex post facto
> Then how do you explain the interaction I had with chatgpt just now?
Predominantly English-language data set with one of the most famous suspension bridges in the world?
How can anyone explain the clustering of data on that? Surely it's the model of the world, and thinking, and neurons.
What happens if you type "most famous suspension bridges in the world" into Google and click the first ten or so links? It couldn't be literally the same data? https://imgur.com/a/tJ29rEC
> If you were arguing in good faith, you'd head directly there instead of lampooning the use of a marketing page in a discussion.
Which part of the paper supports the "models have a world model, reasoning, etc." and not what I said, "in our training data similar concepts were clustered with some other similar concepts, and manipulating these clusters lead to different outcomes"?
Along these lines one model that might help is to consider LLMs 'wikipedia of all possible correct articles'. Start with Wikipedia and assume (already a tricky proposition!) that it's perfectly correct. Then, begin resynthesizing articles based on what's already there. Do your made-up articles have correctness?
I'm going to guess that sometimes they will: driven onto areas where there's no existing article, some of the time you'll get made-up stuff that follows the existing shapes of correct articles and produces articles that upon investigation will turn out to be correct. You'll also reproduce existing articles: in the world of creating art, you're just ripping them off, but in the world of Wikipedia articles you're repeating a correct thing (or the closest facsimile that process can produce)
When you get into articles on exceptions or new discoveries, there's trouble. It can't resynthesize the new thing: the 'tokens' aren't there to represent it. The reality is the hallucination, but an unreachable one.
So the LLMs can be great at fooling people by presenting 'new' responses that fall into recognized patterns because they're a machine for doing that, and Turing's Test is good at tracking how that goes, but people have a tendency to think if they're reading preprogrammed words based on a simple algorithm (think 'Eliza') they're confronting an intelligence, a person.
They're going to be historically bad at spotting Holmes-like clues that their expected 'pattern' is awry. The circumstantial evidence of a trout in the milk might lead a human to conclude the milk is adulterated with water as a nefarious scheme, but to an LLM that's a hallucination on par with a stone in the milk: it's going to have a hell of a time 'jumping' to a consistent but very uncommon interpretation, and if it does get there it'll constantly be gaslighting itself and offering other explanations than the truth.
The problem is a bit deeper than that, because what we perceive as "confidence" is itself also an illusion.
The (real) algorithm takes documents and makes them longer, and some humans configured a document that looks like a conversation between "User" and "AssistantBot", and they also wrote some code to act-out things that look like dialogue for one of the characters. The (real) trait of confidence involves next-token statistics.
In contrast, the character named AssistantBot is "overconfident" in exactly the same sense that a character named Count Dracula is "immortal", "brooding", or "fearful" of garlic, crucifixes, and sunlight. Fictional traits we perceive on fictional characters from reading text.
Yes, we can set up a script where the narrator periodically re-describes AssistantBot as careful and cautious, and that might help a bit with stopping humans from over-trusting the story they are being read. But trying to ensure logical conclusions arise from cautious reasoning is... well, indirect at best, much like trying to make it better at math by narrating "AssistantBot was good at math and diligent at checking the numbers."
> Hallucinating
P.S.: "Hallucinations" and prompt-injection are non-ironic examples of "it's not a bug, it's a feature". There's no minor magic incantation that'll permanently banish them without damaging how it all works.
I'd love to know if the conversational training set includes documents where the AI throws its hands up and goes "actually I have no idea". I'm guessing not.
There's also the problem of whether the LLM would learn to generate stories where the AssistantBot gives up in cases that match our own logical reasons, versus ones where the AssistantBot gives up because that's simply what AssistantBots in training-stories usually do when the User character uses words of disagreement and disapproval.
I find it extremely dumb to see overconfident people that really have nothing special about them or are even incompetent. These people are not contributing positively to the system, quite on the contrary.
But we don't work for the system, fundamentally, we work for ourselves, and the system incentivizes us to work for it by aligning our constraints: if you work that direction, you'll get that reward.
Overconfident people ofc do not contribute positively to the system, but they skew the system reward's calculation towards them: I swear I've done that work in that direction, where's my reward ?
In a sense, they are extremely successful: they managed to do very low effort, get very high reward, help themselves like all of us but at a much better profit margin, by sacrificing a system that, let's be honest, none of us care about really.
Your problem maybe, is that you swallowed the little BS the system fed you while incentivizing you: that the system matters more than yourself, at least at a greater extent than healthy ?
And you see the same thing with AI: these things convince people so deeply of their intelligence that it blew to such proportion that NVidia is now worth trillions. I had a colleague mumbling yesterday that his wife now speaks more with ChatGPT than him. Overconfidence is a positive attribute... for oneself.
Overconfident people are conquerors. Conquerors do not contribute positively to a harmonious system, true, but I'm not so sure we can glean the system is supposed to be harmonious.
If one contributes "positively" to the system, everyone's value increases and the solution becomes more homogenized. Once the system is homogenized enough, it becomes vulnerable to adversity from an outside force.
If the system is not harmonious/non-homoginized, the attacker would be drawn to the most powerful point in the system.
Overconfident people aren't evil, they're simply stressing the system to make sure it can handle adversity from an outside force. They're saying: "listen, I'm going to take what you have, and you should be so happy that's all I'm taking."
So I think overconfidence is a positive attribute for the system as well as for the overconfident individual. It's not a positive attribute for the local parties getting run over by the overconfident individual.
Not talking about THE system or any system in particular same way gaming the system doesn’t refer to any system but just cheating in general. And if you like overconfident people good for you, I can’t stand them because they’re hollow with no basis in reality, flawed like everyone, just pumping out their egos with hot air. And your reasoning that overconfidence is a positive attribute doesn’t make much sense to me but we’re entitled to our own opinions.
Yeah this is what I meant, both in the behavior being intelligent and it being unfortunate that this is the case. It’d be nice if the most self-maximizing behavior were also the best behavior for the global system, but it doesn’t seem that it is.
That's a fallacy. There are certainly some unqualified elected leaders, but humans living in democratic societies have yet to shake the mental framework we've constructed from centuries without self-rule. We invest way more authority into a single executive than they actually have, and blame them for everything that goes wrong despite the fact that modern democracies are hugely complex systems in which authority is distributed across numerous people. When the government fails to meet people's needs, they lack the capacity to point at a particular Senator or a small party in a ruling coalition and blame them. It's always the executive.
Of course, the result is that people get fed up and decide that the problem has been not that democratic societies are hard to govern by design (they have to reflect the disparate desires of countless people) but that the executive was too weak. They get behind whatever candidate is charismatic enough to convince them that they will govern the way the people already thought the previous executives were governing, just badly. The result is an incompetent tyrant.
and getting yourself elected while being underqualified is intelligent? i think its not. its stupid and damaging behavior based in selfish desires. about as far from intelligent you can get.
Intelligence is seperate from goals: if you're only interested in gaining power and wealth for yourself, then concern about the rest of the system is only incidental to what you can get for yourself.
> I think of them as similar to human optical illusions.
What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans than anything to do with what we refer to as "hallucinations" in humans—only they don't have the ability to analyze whether or not they're making up something or accurately parameterizing the vibes. The only connection between the two concepts is that the initial imagery from DeepDream was super trippy.
Inventiveness/creativity/imagination are deliberate things. LLM "hallucinations" are more akin to a student looking at a test over material they only 70% remember grabbing at what they think is the most likely correct answer. More "willful hope in the face of forgetting" than "creativity." Many LLM hallucinations - especially of the coding sort - are ones that would be obviously-wrong based on the training material, but the hundreds of languages/libraries/frameworks the thing was trained on start to blur together and there is not precise 100%-memorization recall but instead a "probably something like this" guess.
It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
> LLM "hallucinations" are more akin to a student looking at a test over material they only 70% remember grabbing at what they think is the most likely correct answer.
AKA. extrapolation. AKA. what everyone is doing to a lesser or greater degree, when consequences of stopping are worse than of getting this wrong.
That's not just the case of school, where giving up because you "don't know" is guaranteed F, while extrapolating has a non-zero chance of scoring you anything between F and A. It's also the case in everyday life, where you do things incrementally - getting the wrong answer is a stepping stone to getting a less wrong answer in the next attempt. We do that at every scale - from inner thought process all the way to large-scale engineering.
Hardly anyone learns 100% of the material, because that's just plain memorization. We're always extrapolating from incomplete information; more studying and more experience (and more smarts) just makes us more likely to get it right.
> It's not "inventive" to assume one math library will have the same functions as another, it's just losing sight of specific details.
Depends. To a large extent, this kind of "hallucinations" is what a good programmer is supposed to be doing. That is, code to the API you'd like to have, inventing functions and classes convenient to you if they don't exist, and then see how to make this work - which, in one place, means fixing your own call sites, and in another, building utilities or a whole compat layer between your code and the actual API.
> Inventiveness/creativity/imagination are deliberate things.
Not really. At least, it's just as much a reflex as any other human behavior to my perception.
Anyway, why does intention—although I think this is mostly nonsensical/incoherent/a category error applied to LLMs—even matter to you? Either we have no goals and we're just idly discussing random word games (aka philosophy), which is fine with me, or we do have goals and whether or not you believe the software is intelligent or not is irrelevant. In the latter case anthropomorphizing discussion with words like "hallucination", "obviously", "deliberate", etc are just going to cause massive friction, distraction, and confusion. Why can't people be satisfied with "bad output"?
If and only if the LLM is able to bring the novel, unexpected connection into itself and see whether it forms other consistent networks that lead to newly common associations and paths.
A lot of us have had that experience. We use that ability to distinguish between 'genius thinkers' and 'kid overdosing on DMT'. It's not the ability to turn up the weird connections and go 'ooooh sparkly', it's whether you can build new associations that prove to be structurally sound.
If that turns out to be something self-modifying large models (not necessarily 'language' models!) can do, that'll be important indeed. I don't see fiddling with the 'temperature' as the same thing, that's more like the DMT analogy.
You can make the static model take a trip all you like, but if nothing changes nothing changes.
> What we call "hallucinations" is far more similar to what we would call "inventiveness", "creativity", or "imagination" in humans ...
No.
What people call LLM "hallucinations" is the result of a PRNG[0] influencing an algorithm to pursue a less statistically probable branch without regard nor understanding.
That seems to be giving the system too much credit. Like "reduce the temperature and they'll go away." A more probable next word based on a huge general corpus of text is not necessarily a more correct one for a specific situation.
Consider the errors like "this math library will have this specific function" (based on a hundred other math libraries for other languages usually having that).
> That seems to be giving the system too much credit. Like "reduce the temperature and they'll go away." A more probable next word based on a huge general corpus of text is not necessarily a more correct one for a specific situation.
I believe we are saying the same thing here. My clarification to the OP's statement:
What we call "hallucinations" is far more similar to what
we would call "inventiveness", "creativity", or
"imagination" in humans ...
Was that the algorithm has no concept of correctness (nor the other anthropomorphic attributes cited), but instead relies on pseudo-randomness to vary search paths when generating text.
So I don't think it's that they have no concept of correctness, they do, but it's not strong enough. We're probably just not training them in ways that optimize for that over other desirable qualities, at least aggressively enough.
It's also clear to anyone who has used many different models over the years that the amount of hallucination goes down as the models get better, even without any special attention being (apparently) paid to that problem. GPT 3.5 was REALLY bad about this stuff, but 4o and o1 are at least mediocre. So it may be that it's just one of the tougher things for a model to figure out, even if it's possible with massive capacity and compute. But I'd say it's very clear that we're not in the world Gary Marcus wishes we were in, where there's some hard and fundamental limitation that keeps a transformer network from having the capability to be more truthful as a it gets better; rather, like all aspects, we just aren't as far along as we'd prefer.
> There are various results that suggest that LLMs do internally have everything they'd need to know that they're hallucinating/wrong
We need better definitions of what sort of reasonable expectation people can have for detecting incoherency and self-contradiction when humans are horrible at seeing this, except in comparison to things that don't seem to produce meaningful language in the general case. We all have contradictory worldviews and are therefore capable of rationally finding ourselves with conclusions that are trivially and empirically incoherent. I think "hallucinations" (horribly, horribly named term) are just an intractable burden of applying finite, lossy filters to a virtually continuous and infinitely detailed reality—language itself is sort of an ad-hoc, buggy consensus algorithm that's been sufficient to reproduce.
But yea if you're looking for a coherent and satisfying answer on idk politics, values, basically anything that hinges on floating signifiers, you're going to have a bad time.
(Or perhaps you're just hallucinating understanding and agreement: there are many phrases in the english language that read differently based on expected context and tone. It wouldn't surprise me if some models tended towards production of ambiguous or tautological semantics pleasingly-hedged or "responsibly"-moderated, aka PR.)
Personally, I don't think it's a problem. If you are willing to believe what a chatbot says without verifying it there's little advice I could give you that can help. It's also good training to remind yourself that confidence is a poor signal for correctness.
> There are various results that suggest that LLMs do internally have everything they'd need to know that they're hallucinating/wrong:
The underlying requirement, which invalidates an LLM having "everything they'd need to know that they're hallucinating/wrong", is the premise all three assume - external detection.
From the first arxiv abstract:
Moreover, informed by the empirical observations, we show
great potential of using the guidance derived from LLM's
hidden representation space to mitigate hallucination.
From the second arxiv abstract:
Using this basic insight, we illustrate that one can
identify hallucinated references without ever consulting
any external resources, by asking a set of direct or
indirect queries to the language model about the
references. These queries can be considered as "consistency
checks."
From the Nature abstract:
Researchers need a general method for detecting
hallucinations in LLMs that works even with new and unseen
questions to which humans might not know the answer. Here
we develop new methods grounded in statistics, proposing
entropy-based uncertainty estimators for LLMs to detect a
subset of hallucinations—confabulations—which are arbitrary
and incorrect generations.
Ultimately, no matter what content is generated, it is up to a person to provide the understanding component.
> So I don't think it's that they have no concept of correctness, they do, but it's not strong enough.
Again, "correctness" is a determination solely made by a person evaluating a result in the context of what the person accepts, not intrinsic to an algorithm itself. All an algorithm can do is attempt to produce results congruent with whatever constraints it is configured to satisfy.
We really need an idiom for the behavior of being technically correct but absolutely destroying the prospect of interesting conversation. With this framing we might as well go back to arguing over which rock our local river god has blessed with greater utility. I'm not actually entirely convinced humans are capable of understanding much when discussion desired is this low quality.
Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation. The only thing intent is necessary for is to create something meaningful to humans—handily taken care of via prompt and training material, just like with humans.
(If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
> We really need an idiom for the behavior of being technically correct but absolutely destroying the prospect of interesting conversation.
While it is not an idiom, the applicable term is likely pedantry[0].
> I'm not actually entirely convinced humans are capable of understanding much when discussion desired is this low quality.
Ignoring the judgemental qualifier, consider your original post to which I replied:
What we call "hallucinations" is far more similar to what
we would call "inventiveness", "creativity", or
"imagination" in humans ...
The term for this behavior is anthropomorphism[1] due to ascribing human behaviors/motivations to algorithmic constructs.
> Critically, creation does not require intent nor understanding. Neither does recombination; neither reformulation.
The same can be said for a random number generator and a permutation algorithm.
> (If you can't tell, I thought we had bypassed the neuroticism over whether or not data counts as "understanding", whatever that means to people, on week 2 of LLMs)
If you can't tell, I differentiate between humans and algorithms, no matter the cleverness observed of the latter, as only the former can possess "understanding."
AI doesn't have a base understanding of how physics work. So they think it's acceptible if in a video some element on the background in a next frame might appear in front of another element that is on the foreground.
So it's always necessary to keep correcting LLMs, because they only learn by example, and you can't express any possible outcome of any physical process just by example, because physical processes can be in infinate variations. LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
So you can never really trust an LLM. If we want to make an AI that doesn't make errors, it should understand how physics works.
I dont think the errors really are all that different. Ever since GPT-3.5 came out Ive been thinking that the errors were ones a human could have made under a similar context.
>LLMs can keep getting closer to match our physical reality, but when you zoom into the details you'll always find that it comes short.
Like humans.
>So you can never really trust an LLM.
Cant really trust a human either. That's why we set up elaborate human systems (science, checks and balances in government, law, freedom of speech, markets) to mitigate our constant tendency to be complete fuck ups. We hallucinate science that does not exist, lies to maintain our worldview, jump to conclusions about guilt, build businesses based upon bad beliefs, etc.
>If we want to make an AI that doesn't make errors, it should understand how physics works
An AI that doesnt make errors wouldnt be AGI it would be a godlike superintelligence. I dont think thats even feasible. I think a propensity to make errors is intrinsic to how intelligence functions.
Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
But if you ask a human to draw / illustrate a physical setting, they would never draw something that is physically impossible, because it's obvious to a human.
Of course we make all kinds of little mistakes, but at least we can see that they are mistakes. An LLM can't see it's own mistakes, it needs to be corrected by a human.
> Physics is just one domain that they work in and Im pretty sure some of them already do have varying understandings of physics.
Yeah but that would then not be al LLM or machine learned thing. We would program it so that it understands the rules of physics, and then it can interpret things based on those rules. But that is a totally different kind of AI, or rather a true AI instead of a next-word predictor that looks like an AI. But the development of such AIs goes a lot slower because you can't just keep training it, you actually have to program it. But LLMs can actually help program it ;). Although LLMs are mostly good at currently existing technologies and not necessarily new ones.
To be clear, I'm not saying that LLM's exclusively make non-human errors. I'm more saying that most errors are happening for different "reasons" than humans.
Think about the strawberry example. I've seen a lot of articles lately where not all misspellings of the word "strawberry" reliably give letter counting errors. The general sentiment there is human, but the specific pattern of misspelling is really more unique to LLM's (i.e. different spelling errors would impact humans versus LLM's).
The part that makes it challenging is that we don't know these "triggers." You could have a prompt that has 95% accuracy, but that inexplicably drops to 50% if the word "green" is in the question (or something like that).
Some of the errors are caused by humans. Say, due to changing the chat to only pay attention to recent messages and not the middle, omitting critical details.
I don't think that's universally true. We have different humans with different levels of ability to catch errors. I see that with my teams. Some people can debug. Some can't. Some people can write tests. Some can't. Some people can catch stuff in reviews. Some can't.
I asked Sonnet 3.7 in Cursor to fix a failing test. While it made the necessary fix, it also updated a hard-coded expected constant to instead be computed using the same algorithm as the original file, instead of preserving the constant as the test was originally written.
Guess what?
Guess the number of times I had to correct this from humans doing it in their tests over my career!
And guess where the models learned the bad behavior from.
> Some people can debug. Some can't. Some people can write tests. Some can't.
Wait… really?
No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
People whose skills you use in other ways because they are more productive? Maybe. But still. Clean up after yourself. It’s something that should be learned in the apprentice phase.
Like my sibling says, you can't always choose. That's one side of that coin.
The other is: Some people are naturally good at writing "green field" (or re-writing everything) and do produce actual good software.
But these same people, which you do want to keep around if that's the best you can get, are next to useless when you throw a customer reported bug at them. Takes them ages to figure anything out and they go down endless rabbit holes chasing the wrong path for hours.
You also have people that are super awesome at debugging. They have knack for seeing some brokenness and having the right idea or an idea of the right direction to investigate in right away, can apply the scientific method to test their theories and have the bug fixed in the time it take one of these other people to go down even a single of the rabbit holes they will go down. But these same people in some cases are next to useless if you ask them to properly structure a new green field feature or rewrite parts of something to use some new library coz the old one is no longer maintained or something and digging through said new library and how it works.
Both of these types of people are not bad in and of themselves. Especially if you can't get the unicorns that can do all of these things well (or well enough), e.g. because your company can't or won't pay for it or only for a few of them, which they might call "Staff level".
And you'd be amazed how easy it is to get quite a few review comments in for even Staff level people if you basically ignore their actual code and just jump right into the tests. It's a pet peeve of mine. I start with the tests and go from there when reviewing :)
What you really don't want is if someone is not good at any of these of course.
> No way do I want to work with someone who can’t debug or write tests. I thought those were entry stakes to the profession.
Those are almost entry stakes at tier-one companies. (There are still people who can't, it's just much less common)
In your average CRUD/enterprise automation/one-off shellscript factory, the state of skills is... not fun.
There's a reason there's the old saw of "some people have twenty years experience, some have the same year 20 times over". People learn & grow when they are challenged to, and will mostly settle at acquiring the minimum skill level that lets them do their particular work.
And since we as an industry decided to pretend we're a "science", not skills based, we don't have a decent apprenticeship system that would force a minimum bar.
And whenever we discuss LLMs and how they might replace software engineering, I keep remembering that they'll be prompted by the people who set that hiring bar and thought they did well.
Little tangent: I realized that currently LLMs can't debug because they only have access to the compile time (just code). Many bugs happen due to run time complex state. If I can make LLMs think like a productive Dev who can debug, then would they become more efficient?
In general, the theme I'm seeing is that we are providing the old tools to a new way of software engineering. Similar to you, I think the abstractions and tools we will work with will be radically different.
Some things I am thinking about:
* Does git make sense if the code is not the abstraction you work with? For example, when I'm vibe coding, my friend is spending 3hrs trying to understand what I did by reading code. Instead, he should be reading all my chat interactions. So I wonder if there is a new version control paradigm
* Logging: Can we auto instrument logging into frameworks that will be fed to LLMs
* Architecture: Should we just view code as bunch of blocks and interactions instead of reading actual LOC. What if, all I care is block diagrams. And I tell tools like cursor, implement X by adding Y module.
Regarding reading all your chat interactions I'd find that a really tedious way to understand what the code will actually be doing. You might have "vibe coded" for 3 hours, resulting in a bunch of code I can read and understand in a half hour just as well. And I'm not interested in all the "in between" where you're correcting the LLM misunderstanding etc. I only care about the result and whether it does the right thing(s) and whether the code is readable.
If the use of an LLM results in hard to understand spaghetti code that hides intent then I think that's a really bad thing and is why the code should still go through code review. If you, with or without the help of an LLM create bad code, that's still bad code. And without the code and just the chat history we have no idea what we even actually get in the end.
I've been a professional engineer for over a decade, and in that time I've only had one position where I was expected to write any tests. All my other positions, we have no automated testing of any kind.
I worked with a new co-worker that ... had trouble writing code, and tests. He would write a test that tested nothing. At first I thought he might be green and just needed some direction - we all start somewhere. But he had on his bio that he had 10 years of experience in software dev in the language we were working in. I couldn't quite figure out what the disconnect was, he ended up leaving a short time later.
I've worked with these sorts of people. It is never clear why they don't perform. One of them had clinical depression, another claimed to have low blood values that they simply couldn't fix. And one other just didn't seem to have any working memory beyond one sentence for whatever reason. Do people become like that? Are we going to become like that? It is a scary thought.
Keyword want - most people don't control who their peers are, and complaining to your boss doesn't get you that far, especially when said useless boss is fostering said useless person.
I agree. I've been been struck by how remarkably understandable the errors are. It's quite often something that I'd have done myself if I wasn't paying attention to the right thing.
Claude Sonnet 3.7 really, really loves to rewrite tests so they'll pass. I've had it happen many times in a claude-code session, I had to add this to each request (though it did not fix it 100%)
- Never disable, skip, or comment out failing unit tests. If a unit test fails, fix the root cause of the exception.
- Never change the unit test in such a way that it avoids testing the failing feature (e.g., by removing assertions, adding empty try/catch blocks, or making tests trivial).
- Do not mark tests with @Ignore or equivalent annotations.
- Do not introduce conditional logic that skips test cases under certain conditions.
- Always ensure the unit test continues to properly validate the intended functionality.
I'm guessing this is a side effect of mistakes in the reinforcement learning face. It'd be really easy to build a reward model that favors passing tests, without properly measuring the quality of those tests.
Agree, but I would point out that the errors that I make are selected on the fact that I don't notice I'm making them, which tips the scale toward LLM errors being not as bad.
Yeah it's the reason pair programming is nice. Now the bugs need to pass two filters instead of one. Although I suppose LLM's aren't that good at catching my bugs without me pointing them out.
I've found both various ChatGPT and Claude to be pretty good at finding unknown bugs, but you need a somewhat hefty prompt.
Personally I use a prompt that goes something like this (shortened here): "Go through all the code below and analyze everything it's doing step-by-step. Then try to explain the overall purpose of the code based on your analysis. Then think through all the edge-cases and tradeoffs based on the purpose, and finally go through the code again and see if you can spot anything weird"
Basically, I tried to think of what I do when I try to spot bugs in code, then I just wrote a reusable prompt that basically repeats my own process.
For that case, it sounds more like having your tools commit for you after each change, as is the default for Aider, is the real winner. "git log -p" would have exposed that crazy import in minutes instead of hours.
I’m working an AI coding agent[1], and all changes accumulate in a sandbox by default that is isolated from the project.
Auto-commit is also enabled (by default) when you do apply the changes to your project, but I think keeping them separated until you review is better for higher stakes work and goes a long way to protect you from stray edits getting left behind.
One problem with keeping the changes separate is the LLM usually wants to test the code with the incremental new changes. So you need a working tree that has all the new changes. But then... why not use the real one?
Plandex can tentatively apply the changes in order to execute commands (tests, builds, or whatever), then commit if they succeed or roll back if they fail.
It's built on top of git, but offers better separation imho than just a separate branch.
For one thing, you have to always remember to check out that branch before you start making changes with the LLM. It's easy to forget.
Second, even if you're on a branch, it doesn't protect you from your own changes getting interleaved with the model's changes. You can get into a situation where you can't easily roll back and instead have to pick apart your work and the model's output.
By defaulting to the sandbox, it 'just works' and you can be sure that nothing will end up in the codebase without being checked first.
If the latest change is bad, how do you go back in your sandbox? How do you go back three steps? If you make a change outside the sandbox, how do you copy it in? How do you copy them out? How do you deinterleave the changes then?
In order for this sandbox to actually be useful, you're going to end up implementing a source control mechanism. If you're going to do that, might as well just use git, even if just on the backend and commit to a branch behind the scenes that the user never sees, or by using worktree, or any other pieces of it.
Take a good long think about how this sandbox will actually work in practice. Switch to the sandbox, LLM some code, save it, handwrite some code, then switch to the sandbox again, LLM some code, switch out. Try and go backwards half the LLM change. Wish you'd committed the LLM changes while you were working on the.
By the time you've got a handle on it, rembering to switch git branch is the least of your troubles.
This is all implemented and working, just to be clear, and is being used in production. Everything you mentioned in your comment is covered.
You can also create branches within the sandbox to try different approaches, again with no risk of anything being left behind in your project until it’s ready.
So instead of just learning git, which everyone uses, your users now have to learn git AND plandex commands? In addition to knowing git branch -D, I also need to know plandex delete-branch?
I'm sure it's a win for you since I'm guessing you're the writer of plandex, but you do see how that's just extra overhead instead of just learning git, yeah?
I don't know your target market, so maybe there is a PMF to be found with people who are scared of git and would rather the added overhead of yet another command to learn so they can avoid learning git while using AI.
I hear you, but I don't think git alone (a single repository, at least) provides what is needed for the ideal workflow. Would you agree there are drawbacks to committing by default compared to a sandbox?
Version control in Plandex is like 4 commands. It’s objectively far simpler than using git directly, providing you the few operations you need without all the baggage. It wouldn't be a win for me to add new commands if only git was necessary, because then the user experience would be worse, but I truly think there's a lot of extra value for the developer in a sandbox layer with a very simple interface.
I should also mention that Plandex also integrates with the project's git repo just like aider does, so you can turn on auto-apply for effectively the same exact functionality if that's what you prefer. Just check out a new branch in git, start the Plandex REPL in a project directory with `plandex`, and run `\set-config auto-apply true`. But if you want additional safety, the sandbox is there for you to use.
The problem is I'm too comfortable with git, so I don't see the drawbacks to committing by default. I'm open to hearing about the shortcomings and how I'd address them, though that may not be reasonable to expect for your users.
The problem isn't the four Plandex version control commands or how hard they are to understand in isolation, it's that users now have to adjust their mental model of the system and bolt that onto the side of their limited understanding of git because there's now a plandex branch and there's a git branch and which one was I on and oh god how do they work together?
> Note that it took me about two hours to debug this, despite the problem being freshly introduced. (Because I hadn’t committed yet, and had established that the previous commit was fine, I could have just run git diff to see what had changed).
> In fact, I did run git diff and git diff --staged multiple times. But who would think to look at the import statements? The import statement is the last place you’d expect a bug to be introduced.
To expand on that, the problem with only having git diff is there's no way to go backwards halfway. You can't step backwards in time until you get to the bad commit just before the good commit, and then do a precise diff between the two. (aka git bisect) Reviewing 300 lines out of git diff and trying to find the bug somewhere in there is harder than when there are only 10.
I just prompted cursor to remove a string from a svelte app. It created a boolean variable showString, set it as false and then proceeded to use that to hide the string
What I mean by this is that we have thousands of years of experience catching human mistakes. As such, we're really good at designing systems that catch (or work around) human mistakes and biases.
LLM's, while impressive and sometimes less mistake-prone than humans, make errors in a fundamentally different manner. We just don't have the intuition and understanding of the way that LLM's "think" (in a broad sense of the word). As such, we have a hard time designing systems that account for this and catch the errors.