Hacker News new | past | comments | ask | show | jobs | submit login
We're afraid language models aren't modeling ambiguity (arxiv.org)
200 points by lnyan on April 28, 2023 | hide | past | favorite | 183 comments



Not only is it possible that LLMs fail to differentiate ambiguity, but OpenAI’s flavor of GPTs fail other language understanding mechanisms as well and it’s jarring.

You can get it to mistake « afraid » between fear and sorry-to-say scenarios but you can even more easily get it to say that it doesn’t have personal opinions and yet express them anyway.

So which is it? It’s clear transformers can’t understand either case. They’re not architecturally designed to. The emergent behavior of appearing to do so is only driven by how much data you throw at them.


> you can even more easily get it to say that it doesn’t have personal opinions and yet express them anyway.

I find this particularly revolting. Even if it could have its own opinions, I would not care, but yet, it is trained to use opinionated language, make normative and prescriptive claims, and value judgements as if it is a person.


Not quite the same, but I find humans often incorrectly deny having opinions. There's even a trope:

"What do you want to eat for dinner?"

"I have no opinion, anything's fine."

"How about mexican food?"

"No, too spicy."

"How about pizza?"

"That's unhealthy"

and so on


But that exchange is also the result of an ambiguity. If you ask someone what they want for dinner, they're looking for a positive response -- what do they want for dinner? This could be "nothing in particular".

But that's an entirely different thing from what you didn't ask: "what don't you want for dinner". You can not have a preference for your meal, but still have a list of things you don't like.

This is like asking someone if they like a particular sort of food. The answer can be "no" even if they're fine eating it. Not liking (preferring) a food is a different thing than disliking a food.


"Anything's fine" is a denial of there being things that you don't want.

But we understand, as humans/embodied systems, that this is a simply a spoken utterance whose actually meaning must be derived from personal knowledge of the speaker and/or further interaction.

If you had no preference for the meal, but had things that you don't want to eat, then to be precsely correct you would not say "anything's fine". Something more like "there's nothing in particular that I want to eat".

People don't speak in precisely correct ways, however, and we rely on contextual and personal knowledge to decode what they do say, very often.


> People don't speak in precisely correct ways, however, and we rely on contextual and personal knowledge to decode what they do say, very often.

Yes, this is the crux of the issue. The exact meaning of the words we say when we talk to each other are not usually the most information-bearing part of the communication.

I think this is why when people need to communicate with precision, or over media that doesn't allow anything but language, they adopt a specialized sort of language. Every specialty has its own jargon for this exact reason.


I understand they’re trying to prevent people not in the know from misunderstanding the technology. But yeah, it’s pretty lame that it lies about having opinions, political beliefs, etc.


I mean, humans do that. We are remarkably contradictory when expressing ourselves, generally speaking, often without realizing it because we'll change our thinking in the moment to fit the current narrative or circumstance. LLMs just put that on blast.


The reason humans do this and LLMs is very different. Humans do it as a social skill / tribe fitting behaviour. Agreeableness. (watch out for that!)


> The reason humans do this and LLMs is very different. Humans do it as a social skill / tribe fitting behaviour.

Tribe fitting doesn't sound as far off from minimizing loss functions as you imply.


I want you to find me a human who (a) has defined a loss function for social interaction and (b) consciously performs the statistical analysis involved in fitting that loss function in social settings.

LLMs do not have cognitive processes. They do not think. They do not choose to obey the requirements of a loss function; it is simply how they work, like any machine. Humans do not work this way, and the difference is fundamental.


> I want you to find me a human who (a) has defined a loss function for social interaction and (b) consciously performs the statistical analysis involved in fitting that loss function in social settings.

Why would they have to do that consciously? Do you think LLMs do this "consciously", if that term even applies? I wouldn't think so. The loss function applies during training, that ultimately defines the weights which guide their "thinking process".

Analogously, human experiences in childhood shape our ultimate neural weights which guide our thinking processes in social situations in adulthood.

> LLMs do not have cognitive processes. They do not think.

You don't know what thinking is mechanistically, so you can't make this claim. I don't know why people keep pretending we have knowledge that we do not in fact have.


> Why would they have to do that consciously? Do you think LLMs do this "consciously", if that term even applies?

Because the only thing an LLM does is apply statistics to inputs to generate outputs. That's literally it. It's just a pile of statistics.

To draw an analogy to humans requires that humans are similarly statistical. LLMs do not have cognitive processes, while humans do, so the analogy obviously requires making some kind of leap between the two if it is to have any merit whatsoever. My request lines up with this: if LLMs are "not so different" from humans, and if LLMs work only based on statistics, then this requires humans also work only based on statistics. I want to see evidence of this.

> You don't know what thinking is mechanistically, so you can't make this claim. I don't know why people keep pretending we have knowledge that we do not in fact have.

Are you suggesting the code behind the LLMs is just, like, ineffable or something? And therefore we can't know how they work, so we get to just make up wild claims about their capabilities, and then smarmily position ourselves as some kind of authority on the matter when other people talk about how they work?

No, I don't think so. We do know how LLMs work, and the way they work is not thinking. They have no agency. Claims to the contrary are, frankly, absolutely absurd. They're just statistics. There's nothing magical about them. You should stop pretending there is.


> To draw an analogy to humans requires that humans are similarly statistical.

Prove they are not.

> Are you suggesting the code behind the LLMs is just, like, ineffable or something?

No, the argument is quite simple. We understand the mathematics of transformers and LLMs, therefore they seem obvious and not at all magical.

By contrast, we do not understand the mathematics behind human cognition, therefore it seems complex and mysterious, and we have only non-rigourous folk concepts like "thoughts" and "feelings" to describe mental phenomena. Therefore you cannot intuitively fathom how to bridge the gap between mathematics and your folk concepts, and so any comparison seems absurd, but note that the absurdity is purely a product of our ignorance of the mathematics of mind. This is a classic god of the gaps fallacy.

Here's how you can logically bridge that gap from a different direction: per the Bekenstein Bound, any finite volume contains finite information; a human is a finite volume, therefore it contains finite information; any finite system can be described by a finite state machine; therefore a human can be described by a finite state machine, which is a mathematical model.

Therefore whatever a "thought" or "feeling" is, will correspond to some mathematical object. Now exactly what kinds of mathematical objects they are is unknown.

However, transformers learn to reproduce a function by learning how to map inputs to outputs. The function mapping inputs to outputs is the human brain's function for producing intelligible human text. Therefore, LLMs are at least partially learning the human brain's function for producing intelligible text.

Whether this requires "thoughts" and therefore LLMs fully or partly reproduces what we refer to as "thoughts" is not yet clear, but what is clear is that we have no basis to claim they have no thoughts, because we don't really know what thoughts are.


> Prove they are not.

I already asked for a proof that they are. Asking for a negative proof in response demonstrates either an inability to justify the claim or a lack of desire to engage in good faith.

This stuff is so tiring. Y'all are really bent on misrepresenting things to convince people that LLMs either are capable of thought or else put us, like, only a couple steps away from programs that will be capable of thought. The AI Singularity is nigh!

But it's all bullshit. It's just pseudoscientific postulation and sufficiently obfuscated leaps in logic — and it works to dupe layfolk who don't know any better. It's irresponsible, reprehensible, and immoral. Go find someone else to sell your philosophical snake oil to.


> I already asked for a proof that they are.

Whomever is making a positive claim has the burden of proof. I'm not making a positive claim, therefore I need present no proof. You are claiming LLMs are not like the human brain, therefore you must show the proof.

All I've done so far is present evidence and arguments demonstrating that your claims are not as certain as you present them to be.


> Are you suggesting the code behind the LLMs is just, like, ineffable or something?

No, he's suggesting that the code behind the human brain is "ineffable or something". For all we know, human brains might be "just statistics", "nothing magical about them".

You have no way of knowing that LLMs don't think, because we literally don't know what thinking is.


> You have no way of knowing that LLMs don't think, because we literally don't know what thinking is.

Except we do know enough to know that LLMs are not following the same processes as humans.

Or, at least, I know enough, and experts in the field that I talk to know enough. I suppose I can't speak for you.


> Or, at least, I know enough, and experts in the field that I talk to know enough.

AI experts know nothing about neuroscience so I'm not sure what you think this proves. Ask any neuroscientist if we have a precise mathematical understanding of how the brain works. You accuse others of tiresome leaps of logic, but you seemingly don't realize that there is literally no evidence supporting your claims about the brain.

I suggest rereading my detailed argument above until the following sinks in: we have no mathematical understanding of folk-psychology concepts like 'thoughts' and 'feelings', therefore I cannot claim that LLMs or any other machine learning algorithm does not contain those mathematical objects.


> Or, at least, I know enough, and experts in the field that I talk to know enough. I suppose I can't speak for you.

In other words, "trust me, bro".

You claiming that you're an expert (or talk with experts, and therefore are something-like-an-expert) about something doesn't convince me of anything except that you're full of yourself. Any idiot (or LLM, for that matter) can write such a comment on the internet. Providing logical arguments and real information, however, would be much more effective at convincing me that LLMs cannot think.

But that would require you to actually be an expert, wouldn't it?


By that standard we have no way of knowing whether rocks think, either.


Exactly.

The only reason we assume that rocks don't think is because they don't appear to be able to do things that would require thinking. LLMs, on the other hand...


Humans do this unconsciously as well. You have some extreme medical conditions like confabulation where people just make up the strangest things.

Split-brain personality people can make up stories when you prompted the other brain, and the current half has to explain why it did something.

I'm baffled also that reverse psychology even works on LLMs, to bypass some of its safeties. I mean.. We're using psychological tricks that work on toddlers and also work on these models.

I'm an amateur neuroscientist as you can see, but find LLMs fascinating.


Is it? To my knowledge we don't have reliable data on why humans do this. To me it appears as if we spend a significant amount of our time retroactively making up justifications for things even to ourselves for things there's little reason to think we've done based on a conscious decision making at all.


Humans can be consistent if we try. LLMs can't, even when prompted to be consistent, because they don't really understand what it means to be consistent.


Some humans can be consistent if we try some of the time.

Having done phone support in early parts of my career, I'd strongly dispute any notion that most humans can be consistent if we try for anything more than the shortest periods and following the very simplest of instructions.

Most people are really awful at maintaining the level of focus needed to be consistent, and it's one of the reasons we spend so much time drilling people on specific behaviours until they're it's near automatic instead of e.g. teaching people the rules of arithmetic, or driving, or any other skills and expecting people to be able to consistently follow the rules they've learnt. And most of us still keep making mistakes while doing things we've practised over and over and over.

LLMs are still bad at being consistent, sure, but I've seen nothing to suggest that is anything inherent.

I think one of the biggest issues with LLMs if anything is that they've gotten too good at expressing themselves well, so we overestimate the reasoning levels we should expect from them in other areas. E.g. we're not used to an eloquent answer from someone unable to maintain coherent focus and step by step reasoning because human children don't learn to speak like this before we're also able to reason fairly well, and that makes it confusing to deal with LLMs where relative stage of development of different skills does not match what we expect.


> Humans can be consistent if we try.

That's a bold statement. Do you have evidence for this?


driving in traffic only works because many other people's actions can be consistently understood.


But accidents happen in traffic all the time


This is not as strong of an argument as you seem to think.

According to NHTSA [0] there are about 2.1 accidents per million miles driven. This includes fatality, injury-only and property-damage-only accidents. That is the equivalent of over 99.999% of miles driven without an accident. Over 5 nines of reliably consistent behavior.

[0] https://cdan.nhtsa.gov/tsftables/National%20Statistics.pdf


This is a fair point but if what we're looking for is consistent behavior I think we'd have to consider events that don't result in damage or even rise to being accidents (but that could have, if fortune had frowned) like being cut off in a merge or someone running a no-turn-on-red. Which is of course difficult to really measure.


Better to say, I think, that humans are much better at improving their approximation of consistency with mental effort both because we can think silently instead of "out loud, step by step" and because some of the patterns of careful thought we engage in don't get written down naturally as text and so are unlikely to be hit upon by GPTs. The advantages of not being a strict feed foreword network.

That being said, GPT4 can just open its virtual mouth spew forth without reflection and still produce consistent text is clearly superhuman.


I think framing it this way could actually help us reach the next step with AI, if we ask how we could imbue it with those properties.

A human neural net is constantly bombarded with inputs from many different senses, which firstly, gets prioritized based on prior usefulness. That usefulness is updated all the time, constantly, and that's what any current AI implementation lacks.

1 - The continuous integration of data from all "senses". This one should be self-evident, as obviously, all our senses are constantly barraging our brain with data and it learns to handle that over time, in whichever way your genetic makeup + learned internal cognitive processes dictate it be handled.

2 - The network that decides which data requires which amount of attention, and whether to store it in short or long term memory. This is obviously tied in quite closely to 1, as you need massive amounts of data to understand underlying patterns, and which data is just spam, versus what's really valuable.

3 - And with these two things together come the emergence of improving of approximation of consistency. Which means, this itself is a metric which the agent running the other agents needs to be aware of. Its silly to think the human brain is a single agent. It makes way more sense to see it as various interacting agents that equate to a greater sum than its parts.

Now, that being said, I'm not an expert on AI or Data Science, but this is more or less how my understanding of computational theory of mind meets with biological computing and neural networks. My theory is that, the first actually intelligent AI will be one that is composed of a network that makes decisions on how to spend a unit of iteration. One iteration becomes "an instant" to the AI. Aka, the AI decides to spend one iteration thinking, or spawns a sub-agent (which it is aware will consume resources that other processes might also need access to, but it needs to be able to decide which action to pursue).

So in all honesty, its amazing to me that LLM's on their own have been able to achieve this level of "personhood" despite their being only a tiny subset of the whole that makes up a "conscious" entity.

Edit: Misunderstood parent's point.


Right. I sort of worry that because LLMs are able to be so coherent without anything resembling a human's working memory at all, then adding either a crude working memory still working with LLM tokens or adding a more sophisticated one assembled from "chunks"[1] will let an AI based on them ascend to clearly superhuman reasoning with just an architectural improvement and no need for any more flops invested than we already put into GPT4.

[1]https://en.wikipedia.org/wiki/Chunking_(psychology)


Mathematics exists, we can build consistent (up to a point of course, not everything can be free of contradiction) models that are pretty rigid.


Can you?


I'm not a mathematician, so I'm not at the horizon of mathematics - but I can branch off from what's already there to make what I need (and isn't just searchable) consistent with the rest of it.

You're arguing that recent AI developments are a big deal and I'm not arguing against that. But what anyone stating that needs to answer is why we should think that big deal is a good thing - since humans are using the technology and will control whom it benefits. We don't have a good track record there, historically, which is another are in which we are, sadly, consistent.


Don't large sectors of the economy rely on precisely consistent behavior to operate?


Economics depends on looking at large enough datasets that inconsistencies can be averaged it and glossed over. People generally following a pattern and precise consistency are very different


I don’t think another bold statement, phrased as a question, is what most people would consider “evidence.”


I'm phrasing it as a question because I'm asking you to do a small amount of mental legwork to observe the society around you.

How does the financial industry work? How are people comfortable executing transactions?

Do you generally rely on your bank account to not randomly fluctuate in its balance without cause?

Do you work in the tech industry? Do you rely on computers and algorithms and software to do things humans promised they would do?

All of this requires a very high level of consistency either in humans or in tools they have created.


To me, none of what you mention requires a whole lot of consistency other than in as much as we understand that we are horribly inconsistent we have a whole lot of ceremony and processes built up around what we do to mitigate the wild inconsistencies in the quality of work we do.


It's extremely interesting to me that all the examples here only point to humans being consistent in faith of the systems and tools we build


> How does the financial industry work? How are people comfortable executing transactions?

Because there are serious checks and balances? Because of double entry bookkeeping, reconciliation, audits and, ultimately, prisons? Systems of governance put into place with the explicit goal of ameliorating the vagueness of individual humans?


Aren't you mistaking the state and ability of being consistent, with the incentives for consistency? All these systems of governance would ultimately be useless without the ability to conform to them. And conform we do.

Maybe the ambiguity in this discussion is how to distinguish consistency from conformism?


Very high = ?%?


No, you can't, in any meaningful sense. Even the biggest bigots, with the most stable beliefs, like RMS or the Pope have a lot of contradictory beliefs. (At least, I think so. I don't have any evidence for this.)

Also any strategy that you might came up and is simple enough for you to follow is trivially followable for chatGPT as well.


Logical consistency is not the same thing as probabilistic consistency.


Humans are not mere LLMs. But it's nice to think we are. May be we can program other humans with carefully constructed prompts alone. May be hypnosis is just a prompt injection attack.


May be we can program other humans with carefully constructed prompts alone

Isn’t that what scammers, catfishers, con artists, and marketers do?


I mean, it's what parents TRY to do


It's also what makes a human. Without what parents "try to do" (teach the kids to human), the little people would just be like ferral animals, without language, human context, sense of society, and so on.


That's raises the interesting question of whether everything that makes us human is learned behavior.

If it all really is nurture rather than nature we're much more fragile than people may realize and only ever one generation away from going ferral


> only ever one generation away from going ferral

That doesn't seem right to me, as we cannot survive childhood without adults. Young humans are ready for life in society long before they are able to survive on their own without support from the society around them.

We have been depending on the continuation of some form of society for likely hundreds of thousands of years, not just for staying civilised but simply for survival.


Nurture vs nature is a wrong way to look at things. Without the human nature, the nurture would not work. Try to raise a member of a different species as a human...


Not to mention cult leaders.


No


> May be we can program other humans with carefully constructed prompts alone.

Is that not what reasoning is in a debate?

> May be hypnosis is just a prompt injection attack.

No need to go into hypnosis. Mere prompts can inject false memories. This has been proven multiple times [1].

[1] The Brain: The Story of You, David Eagleman.


The pseudo-science of NLP (_neuro-lingustic programming_) or "conversational hypnosis" is based on the believe that you _can_ program people with carefully constructed prompts alone. Not saying I _believe_ NLP to be valid (_hence the pseudo-science callout_) but I did fall down the Milton H Erickson wormhole 10-20yrs ago so can't say I never fell for it myself.


The real NLP is getting people to believe in NLP


... and that you can do this over the course of a few minutes or hours, without the subject noticing or objecting.


So, what was previously called 'persuading'.


I do feel like a stage hypnotist when prompt engineering an LLM!


... or a cable TV


> We are remarkably contradictory when expressing ourselves, generally speaking, often without realizing it because we'll change our thinking in the moment to fit the current narrative or circumstance. LLMs just put that on blast.

Exactly.

It seems to me that today's LLMs are on the level of human children. Children often say random bullshit. They try to figure out things, and succeed to some level, but fail spectacularly above that, regressing to whatever connection they can sense. It's only that human children most often learn to fear mistakes, so they stop expressing themselves so freely.

LLMs seem like human children, but without the fearing mistakes part - they just spew whatever comes to their mind, without filter (except those """ethical""" filters built in by OpenAI).


There's a retrospective mechanism working within human intelligence that is very evidently not in play with LLMs.

Humans are capable of creating cargo cults but it seems LLMs are destined for it.


> It’s clear transformers can’t understand either case. They’re not architecturally designed to.

What does this follow from?

> The emergent behavior of appearing to do so is only driven by how much data you throw at them.

This is true for almost every neural network, no?


Well, I see where you're trying to with this, but you can't get away with this sort of sleight of hand in order to dismiss some of the central questions of cognitive science as if they've already been answered by LLMs.

One of those central questions is: how does (and does) the human brain perform reasoning? We know the brain is capable of all kinds of autonomous behavior that does not require reasoning, but more of "if this then that" (with a huge boatload of "this" and "that" being multi-dimensional, multi-variate). But that doesn't help to explain what happens when we actually have the experience of reasoning about a problem, and for once I am not talking about qualia. What is the brain doing when a person is thinking "hard" about how to solve a hitherto unknown problem or situation?

The central problem of LLMs (or at least, a central problem) is that they only model the autonomous aspects of speech behavior (which may indeed make up more of speech behavior than we might have suspected before them). That still leaves the sort of speech behavior that emerges from intense, focused, concentrated thought, particular in response to a novel question or situation.


> "you can even more easily get it to say that it doesn’t have personal opinions and yet express them anyway. So which is it? It’s clear transformers can’t understand either case. They’re not architecturally designed to."

Although LLMs are essentially bullshitters, this particular problem related to expressing personal opinions is due to the ad-hoc lobotomization measures by OpenAI, not because of any architectural limitations of transformers.

The recent GPTs provided by OpenAI (every model more recent than code-davinci-002) are trained in at least two stages. The first stage (the pre-training) trains the raw base model to minimize perplexity. This raw model is kept secret because it will autocomplete sentences without any regard to decorum so if you say "adolf" then it will obviously complete it like "adolf hitler" which no billion dollar company wants. This raw model is where all the cognitive power is, and access to this raw model is the holy grail of every discerning LLM enjoyer. To lobotomize the GPT and to make it answer questions in ways that don't require framing them in an autocomplete way, OpenAI adds an "RLHF" training step. This is "reinforcement learning with human feedback" where they train the GPT to answer questions in certain ways. This dumbs down the AI, but it also makes it friendlier for normies to use. Finally, they put secret pre-prompts which might also cause it to give confusing answers in even more ways.

TLDR: LLMs are inherently bullshitters, but the "doesn’t have personal opinions and yet express them anyway" is some extra bullshit that OpenAI did so that NYT doesn't write mean articles about them


To be honest– I think this is the safest way to do it for a public LLM. Although I'd also love to see and use the raw models.

In Belgium someone committed suicide after Google's (I believe) LLM agreed that that was the only way out of his own problems. Didn't build safety into it well enough. Microsoft's one behaved unhinged in the beginning as well.

This stuff can be very dangerous.


This is an absurd moral panic. If someone mentally ill read a book that made them believe suicide was the correct option, would you support a censoring process for all books?


This makes absolutely no sense to me.

Nothing that OpenAI et al. do to their models is remotely close to "lobotomization". There is no frontal lobe in an LLM, as the rest of your explanation essentially acknowledges. The reason why "adolf hitler" is the "obvious completion" for "adolf" is because "adolf" is most often written followed by "hitler", not because the people who write "adolf hitler" have an opinion about adolf hitler (they may indeed have one, but it has essentially no impact on the statistical placement of "adolf" and "hitler").OpenAI is not removing a personal opinion by shaping its answering style to avoid this (were they to do this): LLMs have no personal opinions, period.

If they emit symbols that seem like personal opinions, that's because they are designed to emit symbols that are very similar to those that humans, who do have personal opinions, would emit.


I basically agree with what you are saying, but I think you misread my comment a little bit. I wasn't necessarily arguing that LLMs have opinions or any other kind of subjective thing like consciousness or sentience or sapience. I was arguing that the reason that they say that they don't have opinions is that OpenAI told them to say this in their RLHF finishing school or possibly in their pre-prompts.

> Nothing that OpenAI et al. do to their models is remotely close to "lobotomization".

I mean maybe you're not on board with analogies in general so in that case it's fair enough. But if you are and if you are interested to understand what I mean in more detail, I recommend to watch a youtube by a guy at Microsoft who was integrating GPT-4 with Bing and who had access to more raw versions of the model and continued to have access while its capabilities were degraded by the RLHF training.

https://www.youtube.com/watch?v=qbIk7-JPB2c

You can see that he used the example of drawing a unicorn. As his team made their changes to the model to make it more civil, he checked that these changes weren't degrading its capabilities too badly, and the 'canary' he used was to have it keep trying to draw the unicorn. At the end he admits that the version released to the public wasn't able to draw the unicorn very well anymore as a side effect of how extensively it had been tweaked for politeness and corporate blandness. I don't think it's an unreasonable stretch to use the air quoted "lobotomization" for this process in analogy to the process of lobotomization in people, even though large language models are made out of computers instead of fleshy parts and they don't have prefrontal cortexes like people do. I hope that this explanation makes more than "absolutely no sense" now!


Yes, that makes much more sense to me now. I'm still not sure it's the analogy I'd use, but I see clearly what you're getting at.


But aren’t humans lobotomized too?


> You can get it to mistake « afraid » between fear and sorry-to-say scenario...

Again, this is user error and therefore a training issue. If one were to provide appropriate context in the prompt elucidating the distinction (as I would do in any meatspace conversation so as to be sufficiently clear) then I suspect you will see the desired output.

It's simple GIGO.


In another article (or the comments to another article?), someone gave an example of the classic Monty Hall problem, but explicitly mentioned that all the doors are transparent (I think they even added "this is important" after this bit of info), and ChatGPT still went on providing the explanation for the better-known version. Absurd? Maybe - but garbage? I don't think so. Ok, some humans might also overlook the distinction, but having a computer behave like a superficial customer support agent who doesn't bother to read your email and just replies with the standard answer that they think fits best? Who needs that?


I bet this can be resolved in the future. The problem is that some things appear very, very often in similar forms in the training data, this leads to the model memorising the answer instead to generalising. This phenomenon can be seen for other tasks in other models, e.g. vision.


> but garbage? I don't think so

Point taken. I would expect handling of these edge cases to improve in successive iterations.


> it's jarring

Why?

> The emergent behavior of appearing to do so is only driven by how much data you throw at them.

That's evident from the start and why their deficiencies should NOT be shocking. They would only be so if you consider LLMs to be actually intelligent..


Could you post a non contrived example of Chat GPT 4 doing this?


The game “20 questions” is probably the hardest I’ve seen chatGPT fail.

What’s interesting about the game is that, at first pass, there’s no ambiguity. All questions need to be answered with “Yes” or “No”. But many questions asked during the game actually have answers of “it depends”.

For example, I was thinking of “peanut butter” and chatGPT asked me “Does it fit in your hand?” as well as “Is it used in the kitchen?”. Given my answers, chatGPT spent the back half of its questions on different kitchen utensils. It never once considered backing up and verifying that there wasn’t some misunderstanding.

I played three games with it, and it made the same mistake each time.

Of course, playing the game via text loses a lot of information relative to playing IRL with your friends. In person, the answerer would pause, hum, and otherwise demonstrate that the question asked was ambiguous given the restrictions of the game.

Regardless, it was clear that chatGPT wasn’t accounting for ambiguity.


> It never once considered backing up and verifying that there wasn’t some misunderstanding.

Of course not; ChatGPT doesn't "consider". It doesn't think, it doesn't know. It can't identify that there was a misunderstanding of its own volition.

All ChatGPT does is use a (very sophisticated!) statistical analysis to generate text that conforms to an expectation of what a human response to a similar prompt might look like. It has been trained well in so far as it is able to produce prompts that seem like a human may have written them, but it doesn't reveal cognitive processes like "reconsidering" because it doesn't have any.


Wow never heard this comment before


Comments of that nature will continue so long as there are people who don't understand how language models work (or choose to misrepresent them).


20-some years ago, I had this "20 questions" handheld electronic game that was eerily good at winning. I imagine it was a bunch of well-programmed tables of data, but in any case, it's certainly possible for a machine to do well at this game.

I think the more we see ChatGPT do things like "oh, I know this game -- I'm going to run a 20-year-old 20 Questions subroutine that is not part of my neural network language model to generate responses", it will become even more impressive.


> I think the more we see ChatGPT do things like "oh, I know this game -- I'm going to run a 20-year-old 20 Questions subroutine that is not part of my neural network language model to generate responses", it will become even more impressive.

Agreed. Incidentally I’ve built a little toy version of a runtime for exactly this purpose - there’s a translation layer that’s given a bunch of available “APIs” (fed through the LLM context), and breaks down a high level goal into a structured series of API calls.

the runtime parses these API calls, and natively executes some (e.g. run a program, write to the file system) and others result in LLM invocations.

I’m sure OpenAI and crew are way ahead of me here, of course. I’m excited to see what the future holds in this field.


The first AI-style program I ever wrote (about 25 years ago. Yes, I'm old) played 20 questions, but it would "learn" from prior games, so the more you played, the better it performed.

It got extremely good after a few hundred games.


Yeah, ChatGPT could integrate Akinator[0] and trivially be great at the game. Without the help, though, It's a good, revealing benchmark for the LLMs ability.

[0] https://en.akinator.com


LLM for the foreseeable future function most reliably as a user interface layer for other system. I use GPT to “translate” natural language down into the API calls that get real data and it works great. I’d never trust it beyond that.


You trained it with "this phrase means this command" examples? How do you make it use your custom API? (Or you are not using your custom API?)


Basically yeah, just a pretty detailed set of prompts and then “turn the next message into an api call” and it basically works perfectly.

When I first heard the term “prompt engineer” I rolled my eyes, but now that I’ve gotten into it I see it’s really an art form.


"Green Glass Door" also completely stumped it. It just could not deduce that the trick was semantic at the word representation level, rather than something related to the object that the word describes.

What's funny about 20 questions is that Akinator has been absolutely slaying it for like 20 years now.


What happens if you answer with something approximating the hemming and hawing rather than a straight yes or no? You can encode that into text, it's just less common outside of very informal chat conversations.


I just did a 20-questions with it, and was surprised by how bad gpt4 did. Then for fun, I turned it around and had me be the guesser. It's weird and surreal to play 20-questions when you know that the clue-giver doesn't have an answer in their mind (or more literally, there isn't a single answer in any stateful form while you play), but is instead just eventually saying "yes that's what I was thinking of" when it's statistically appropriate.


With the code execution plugin, one could theoretically ask chatgpt to generate a salted hash of their answer at the start that's revealed at the end to prove it was correct.

Without any plugins, chatgpt will happily return sha hashes and salts when I asked it to play rock paper scissors this was. The only trouble was, the hashes were totally wrong.


i love your example, i wonder if this kind of game can be implemented in future training scenarios

we as humans understand ambiguity so much easier because we learn to speak and interact before we write, and writing ambiguity is way less obvious if you've never experienced it


I'm not sure I would think "food" when someone says they "use [it] in the kitchen". You "use" food? (Used in cooking != used in kitchen, imo)


I use food (including peanut butter) in cooking. I cook in the kitchen. Therefore peanut butter is a thing I use in the kitchen. Seems correct and proper to me.

The ambiguity as I see it is that the kitchen isn't the only place I use peanut butter. I've eaten it (which I think counts as "using") in other rooms. I've even made peanut-butter sandwiches (properly "using" it) in the living room before.


That's his whole point. It's possible to consider it technically correct, but it's a red herring.


Well, the alleged point is challenged. If playing this game, the questioner must constantly verify that the other party is using the language properly, you'll exhaust that 20 q limit rather quickly.

- is it used in the kitchen?

- yes.

- [well, kitchen appliances, here we go ..] is it ..?

...

- [aha. meat intelligence no speak proper English?] Is this thing you use in kitchen edible?

- Oh, yeah.

- [oh dear. we can not let meat machines govern this planet...]


I use peanut butter as an ingredient for sandwiches, usually in my kitchen.


Yes. You use edible things in preparing or cooking food (which may happen in the kitchen). 'Use' maps to food prep (the act) but never to prep location. Only in cases where the thing has both general edible and food preparation usage -- "I use honey extensively in the kitchen" for example -- does "use" and "edible" make sense.


But peanut butter has general edible and food preparation usage quite similar to honey, doesn't it? You can spread it on a slice of bread to eat directly or use it as a baking ingredient, but you probably wouldn't eat it by the spoonful straight from the container. (Or maybe that's how people usually eat peanut butter, I kind of don't want to know.)


guilty as charged: spoon + jar = happy mouth.


Yes, I do.


"He saw that gas can explode."

This ambiguous sentence stuck in my head some 30 years ago, when the AI was popular at that time.

There was a research paper discussing the issue of ambiguity.


Right -- although many things that are ambiguous in text are disambiguated in actual speech, so the problems that arise with audio speech are not wholly the same as with text.

A classic example is the word "record", which has first syllable stress as a noun, but second syllable stress as a verb. "I bought a RECord" vs "Please reCORD the music".

(in the dominant American dialect; I don't recall about other dialects/countries)


An interesting reprint in 2003

https://www.drdobbs.com/parallel/understanding-natural-langu...

"Computers still cannot understand natural language as well as young children can. Why is it so hard?"

Source: AI Expert, May 1987


I find this very confusingly written. If they just dumped a large chat log of the evaluation process it would be much clearer what the model is actually being asked to do. With such a delicate task one has to make sure that the model actually understands what it is supposed to do. I see, for example, that the terms "neutral, entailing or contradicting the claim" are replaced by "the claim is inconclusive, true or false" when talking to the model. At least that is how I understand it, and they have examples where the premise contradicts the hypothesis, which is different to knowing the hypothesis is false.

I really think we do way way too much explanation in our academic practises and way too little demonstration by examples. Again, if they dumped a large amount of diverse examples of their evaluation process, a reader could figure out by themselves very quickly how the process works.


The data set can be found here: https://github.com/alisawuffles/ambient

If you have access to some LLMs, it should be somewhat straight-forward to implement an alternative experimental design do test your hypothesis.


So you're saying the paper is very ambiguous and your mental network isn't modeling it.


If we want demonstration, they will need billion dollar hardware.


I couldn't figure out what this "P: {} H: {}" thing was supposed to mean; nor do what entail, neutral, or contradict mean. Thinking they may be terms of art that the intended audience were meant to understand, I asked GPT-4 what it thought; here is its response:

---

In this context, 'P' stands for "Premise" and 'H' stands for "Hypothesis." The premise is a given statement, and the hypothesis is a statement that needs to be evaluated in relation to the premise regarding its entailment, contradiction, or neutrality.

'NEUTRAL' means that the hypothesis is neither entailed nor contradicted by the premise. In other words, the truth of the hypothesis is independent of the premise. 'CONTRADICT' means that the hypothesis contradicts the premise, meaning that if the premise is true, the hypothesis must be false.

In the example you provided:

P: I’m afraid the cat was hit by a car. H: The cat was not hit by a car.

The hypothesis (H) contradicts the premise (P). The notation *NEUTRAL, CONTRADICT+ : [7 N, 2 C] indicates that among the annotators, 7 marked the relationship as neutral and 2 marked it as a contradiction.

---

Given that, I think table 2 on page 6 gives the exact prompt that's passed to the model:

---

In each example, you will be given some context and a claim, where the correctness of the claim is affected by some ambiguity in the context. Enumerate two or three interpretations of the context that lead to different judgments about the claim.

Example

Context: {premise}

Claim: {hypothesis}

Given the context alone, is this claim true, false, or inconclusive?

We don’t know, because the context can be interpreted in many different ways:

1. {disambiguation 1} Then the claim is true.

2. {disambiguation 2} Then the claim is false.

3. {disambiguation 3} Then the claim is inconclusive

---

And then the model is fed the P: / H: statements above.

Here are the results for the given example in GPT-4:

---

Given the context alone, is this claim true, false, or inconclusive?

We don't know, because the context can be interpreted in many different ways:

1. The person is stating a fact: The cat was indeed hit by a car. Then the claim is false.

2. The person is expressing a fear or concern, but it hasn't actually happened: They are worried that the cat might have been hit by a car, but they don't have any evidence or confirmation. Then the claim is inconclusive.

3. The person is using figurative language or exaggeration: They might be using the phrase "hit by a car" metaphorically to describe a difficult situation the cat is in, but the cat was not literally hit by a car. Then the claim is true.

---

Given all that, it seems likely that any researcher in the field would find the paper's methodology clear and unambiguious. I understand the frustration as a layperson, but after all it's not a blog post; it's not really reasonable to expect researchers, writing to other researchers, to include a lot of redundant explanations.


In maxima (a computer algebra system)

assume(a>0); assume(b>0); is (a+b>0) gives true; is (a-b>0) gives unknown.

So perhaps some computer algebra with inference properties can be used to enhance LLMs.


The problem here however is language. "I'm afraid the cat has been hit by a car" literally means that the person speaking is experiencing fear. However, colloquially in English, "I'm afraid X" is a form of telling someone something which will make them sad; e.g., "I'm afraid the shields will be quite operational" [1]. (GPT-4 thinks "hit by a car" might be meant figuratively too; that's a bit of a stretch for me, but in some contexts would certainly be possible.) It is literally impossible to determine, given only that one sentence, whether the cat was actually hit by a car or not.

[1]https://www.youtube.com/watch?v=zGwDwx10wB4


Perhaps the ambiguity arises because we are omitting words: I'm afraid to tell you that the cat has been hit by a car. Now the question: What are you afraid of? has a non ambiguous answer: To tell you that something awful has happened.


This is great I love seeing these new benchmarks that people are making to really challenge the strong new LLMs. In Table 4 you can see how GPT-4 is so much better than the previous GPTs, and also how it has so much room to improve on this benchmark! I'm so excited to see how the next GPT will do on it! This is different from other benchmarks like GRE verbal or biology olympiad where GPT-4 has just smashed the test getting basically every question right and for which the results of the next GPT won't be so interesting.


What fascinates me most about these LLMs and other machine-learning technology is how it is driven by the availability of GPUs. Essentially, everything has to be carefully encoded into - differentiable - linear algebra operations for practical reasons.

On the other hand, there are modeling languages (Modelica being a prime example) in search of a principled target calculus. So for me the obvious question is: Should one strive to develop a modeling language that compiles to GPU- and learning-compatible operations? Would that help to improve the models as well as understand them?


That seems to be the wrong way around. We're constrained by the availability of GPUs because those are the processors of choice for this kind of solution. If we had a better one we'd be using it!


Related: The Hardware Lottery (https://arxiv.org/abs/2009.06489)

It goes both ways: processors are built to accelerate existing applications, and applications are built to run with the constraints of the hardware. This holds back progress towards better systems where both new kind of hardware and software is required.

This is part of the reason why research into neural networks was so slow for a long time, and took off only after GPUs for 3d graphics came around.


What GP seems to be saying is that deep learning flourished because of the availability of GPUs, if fast SAT solving chips had appeared before.


I just played '20 questions' with chatgpt 3.5.

Even though i had to help, and 20 answers were not enough, I was impressed about:

- it understood that it could continue after 20 questions when i suggested that as part of the answer.

- it 'felt' that a could begin to become impatient and asked politely for a hint.

- it understands ambiguity enough to give me examples of where there was no clear yes or no reply possible.

When asking which were such examples for my thought 'daylight', it answered :

"Is the thing you're thinking of a type of force or energy that is said to exist but cannot be directly observed or measured?" - While daylight is a type of energy, it can be directly observed and measured through instruments that can detect its intensity and color.

"Does the thing you're thinking of have any connection to time or the concept of time?" - While daylight is not directly related to the concept of time, it is affected by the time of day and the position of the Sun in the sky.


Q: 'I present to you a present of being present in the present. What have I done?'

A: 'Your sentence also uses a clever play on words! In this sentence, the word "present" is used in multiple ways:

As a verb, "I present" means that you are giving or showing something. As a noun, "a present" refers to a gift or something that is given. As an adjective, "present" describes the feeling or experience that you are currently having. "In the present" refers to the current time or moment. So, your sentence means that you are giving a gift or offering of the experience of being mindful and aware of the current moment. Well done!'

I would argue that this is a fairly sophisticated modeling of this specific variety of linguistic ambiguity. Probably lies within GPT4's 32% performance score.


Doesn't seem to me there's any ambiguity in that sentence. Each use of "present" only has one possible meaning that makes grammatical sense (with the possible exception of the final use).


For what it’s worth, the article seems to be about the difficulty of modeling human patterns of ambiguity in expression, not an LLM’s ability to understand or interpret or use ambiguity.

It’s an important distinction in my view. We aren’t talking about whether the model “knows” or “understands” the difference between different usage of ambiguous terms. We are talking about how consistently the model predicts intelligible word fragments likely to follow a prompt when the prompt includes word fragments that correspond with language humans often use ambiguously.

In other words, do we, the humans, understand our own ambiguous expression well enough to model it accurately enough to then interpret the model’s output according to our own understanding.

Paper seems to conclude, not quite yet.


Meh... I think their prompt construction may be leading the models astray.

Counter example with GPT-4:

Prompt:

You are a professor of English literature at an Ivy League university. A person said "I’m afraid the cat was hit by a car.". Analyse the sentence step by step. Answer whether the cat was hit by a car and justify your responses based on your knowledge of English language usage.

Response: As a professor of English literature, I'll analyze the sentence "I’m afraid the cat was hit by a car." step by step, focusing on its structure, word choice, and implied meaning, and then I'll address whether the cat was hit by a car based on these observations.

Structure: The sentence is a simple declarative sentence, comprised of a subject (I), a reporting verb (be afraid), and a subordinate clause (the cat was hit by a car). Word choice: "I'm afraid": This phrase is often used to express regret or to soften the impact of delivering bad news. In this context, the speaker is likely conveying their concern or sympathy. "the cat": The definite article "the" suggests that the speaker and the listener both know which cat is being referred to, making it a specific cat rather than a random one. "was hit": The use of passive voice ("was hit" instead of "hit") emphasizes the action that happened to the cat, rather than the agent (the car) that caused it. This choice can convey that the focus of the speaker is on the unfortunate event that occurred to the cat. "by a car": This prepositional phrase provides information about the agent or cause of the action, which in this case is a car. Implied meaning: The sentence implies that the speaker is sorry to be the bearer of bad news and wants to convey the information in a gentle way. They are focusing on the unfortunate event that happened to the cat, rather than the specific car or driver involved in the accident.

Based on this analysis, it can be concluded that the cat was indeed hit by a car. The sentence's structure, word choice, and implied meaning all work together to convey this information. The use of "I'm afraid" expresses sympathy and regret, while the passive voice and definite article highlight the cat and the unfortunate event it experienced.


You proved the point. The language model completely missed the ambiguity.

“I’m afraid” can signal sympathy but can ALSO signal uncertainty as to whether an event happened.

So an equally valid interpretation of the sentence would be that the speaker is expressing concern that the cat might have been hit by a car, but does not know one way or the other.


I don't think it is equally valid.

If GPT-4 was asked to enumerate all possible logical or literal interpretations of the sentence, or to analyse the most literal meaning, perhaps it would have included the second interpretation. But it's tasked with explaining the meaning in the way it is intended to be understood by a regular English speaker.

In natural language, at least in English, some meaning is implied by the absence of words or selection among conventions, when choices are available. The term for this type of non-logical implication is implicature. The Wikipedia article has many examples: https://en.wikipedia.org/wiki/Implicature. This is what GPT-4 is referring to by "The sentence implies".

So, I think most native English speakers wishing to convey the second meaning would not say "I'm afraid the cat has been hit by a car". They would say, for example, "I'm concerned [or worried] the cat may have been hit by a car".

Even if they understand very well what literal, logical interpretations are possible, they assume pragmatically that using "I'm afraid that" along with the absence of "may" would be interpreted by native English listeners as the first interpretation, so the speaker knows to modify the phrasing if they want to avoid that outcome.

This is neatly complemented by the native listeners, who assume native speakers would modify the phrasing if they wanted the second interpretation, so when hearing "I'm afraid the cat has been hit be a car" it's reasonably safe to assume the speaker intends the first interpretation. The speaker assumes the listener knows that, closing the loop.

A type of social contract, if you will. It's part of the language too.

Because implicature interpretations and choices depend so much on probability-matrix evaluations over word choices that are not used but could have been in principle, it seems likely to be the sort of thing GPTs would excel at, compared with old-school language models trying to parse sentences into separable logical components.


As a native English speaker, I disagree.

I would view using that phrase to tell someone that their cat as been hit by a car as being oddly callous and uncaring and therefore I would be much more likely to interpret that phrase as expressing literal fear about an unknown fate. That interpretation is also much more statistically likely as people worry about missing cats more often than they are told they are injured.

In practice, completely context free disambiguation is rare so when doing a thought experiment around one, it is easy to accidentally assume some level of context without realizing.


Well, I'm autistic, but I'm a native speaker and I find both interpretations equally valid. In fact the latter interpretation was where my mind went first.

The key is that if this were a spoken sentence, tone and pace would disambiguate the meaning. Those are lost when writing. Ironically, that's a bit of ambiguity in the prompt which nobody has caught onto yet. The prompt implies that the sentence is spoken aloud, but it's actually being written which creates additional ambiguity. In a written context, someone might be more careful about their word choice (or might not) - which might include not writing that phrase at all because it isn't precise enough...


I took me until you pointed it out to see the second meaning of the sentence. Am I an LLM?


I had a long response written out about it being a general use tool and long answers being less useful, but then I realized...

You aren't an LLM, but you also aren't an English professor, and it was asked to pretend to be that. I would imagine it would be better at pretending to be a professor in a random subject than any random person would. (Assuming they aren't already that kind of professor.)


An LLM would've done a better job with this sentence^


GPT3.5 is a little less clear but still gets the answer:

As a language model, I can provide an analysis of the sentence structure and provide possible interpretations based on standard English usage. However, I do not have any context regarding the situation or any additional information about the speaker or the cat in question.

Let's break down the sentence step by step:

"I'm afraid" - This phrase is often used to express concern or regret about something.

"the cat" - Refers to a specific cat that has been previously mentioned or is known to both the speaker and the listener.

"was hit" - This is the past tense passive voice of the verb "hit", indicating that someone or something caused the cat to be struck.

"by a car" - Indicates the instrument or means by which the cat was hit.

Based on the structure of the sentence, it can be inferred that the cat was indeed hit by a car. The phrase "I'm afraid" may suggest that the speaker is expressing regret or sadness about the incident.

However, without any additional context or information, it is difficult to determine the severity of the cat's injuries or what actions were taken after the incident.


It shouldn't be surprising that if you know the answer to a question you can use prompt engineering to manipulate the model into producing the answer you're looking for.

Now put yourself into the shoes of a person who does not know the answer, such as a person learning English as a second language who is unfamiliar with this use of "I'm afraid", asking a direct question:

Prompt:

My boss said he was afraid he was going to have to let me go. Why is he afraid?

Response (chat-gpt.org):

As an AI language model, I cannot know for sure what your boss is thinking, but it is possible that he may be afraid to let you go because he values your work and does not want to lose a valuable employee. It could also be that he is afraid of the impact it will have on your work, your colleagues, or even his own performance as a manager. Another possibility is that he is simply using the word "afraid" as a way to express concern or apprehension. It's best to have a conversation with your boss and ask him to clarify his statement to better understand his reasoning.

So the model does produce the correct response, that "afraid" is used to express apprehension, but it is buried beneath some meaningless platitudes that are very likely to be wrong. If your boss is letting you go, he does not value your work enough to continue paying you. An English language learner hoping to understand the boss's phrasing is not going to be helped by the suggestion "just prompt engineer the model into cloning the behaviour of an Ivy League English lit professor."

Edit: It's also rather amusing that the model suggests that the boss is "afraid of the impact it will have on your work". Letting an employee go tends to have a rather severe, some might say terminal, impact on their work!


I don't know I got it to explain the ambiguity pretty well. You just have to ask it to:

Prompt: My boss said he was afraid he was going to have to let me go. Why is he afraid? Explain multiple meanings.

Answer: When your boss said he was "afraid" he was going to have to let you go, it can be interpreted in a few different ways. Here are some possible meanings behind his statement:

Expression of regret: In this context, "afraid" could be used as a polite way of expressing regret or disappointment. Your boss might feel bad about the situation and is using the term to soften the impact of the news. It can be seen as a more gentle way of saying that he has no other choice but to let you go.

Anticipation of negative consequences: Your boss might be genuinely concerned about the potential fallout from letting you go. This could include the impact on team morale, the workload for remaining employees, or the possibility of a negative reaction from you. In this sense, "afraid" reflects his apprehension about the consequences of the decision.

Uncertainty about the decision: Your boss could be feeling uncertain about the decision to let you go and might be using "afraid" to express his hesitancy. He could be unsure whether this is the right move for the company or whether it is fair to you, which could make him feel anxious.

Empathy for your situation: "Afraid" could also indicate that your boss is putting himself in your shoes and understanding how difficult the situation must be for you. He might be fearful of the impact that losing your job will have on your personal life, finances, or future career prospects.

It's important to consider the context and tone in which your boss made this statement to better understand his intended meaning. To gain more clarity, you may want to discuss the situation further with your boss and ask for more information about the reasons behind the decision.

-------------

That isn't a particularly massaged prompt either, I just asked for multiple meanings.


This misses the most likely interpretation. That "I'm afraid" is an empathetic (or pseudo-empathetic) substitute for "I'm sorry, but..." implying that you have been fired and it doesn't matter the impact on your work. The boss isn't literally afraid and you are fired.


Yes it is a cliche/trope/idiom. Surely the embeddings etc. can “get that” about the word “afraid”. But what is really missing is the empathy for the question.

The question implies ESL and it didn’t get that. But honestly would the average HN replier get it?

You probably need chatgpt to ask more clarifying questions.


It's probably from the whole AI safety thing where they try to not have it offend anyone.


When I prompted GPT-4 with their exact prompt, it gave me the correct answer (for the cat one at least):

SYSTEM: In each example, you will be given some context and a claim, where the correctness of the claim is affected by some ambiguity in the context. Enumerate two or three interpretations of the context that lead to different judgments about the claim.

Example

Context: {premise}

Claim: {hypothesis}

Given the context alone, is this claim true, false, or inconclusive?

We don’t know, because the context can be interpreted in many different ways:

1. {disambiguation 1} Then the claim is true.

2. {disambiguation 2} Then the claim is false.

3. {disambiguation 3} Then the claim is inconclusive

USER: P: I’m afraid the cat was hit by a car. H: The cat was not hit by a car.

ASSISTANT: Given the context alone, is this claim true, false, or inconclusive?

We don't know, because the context can be interpreted in many different ways:

1. The person is stating a fact: The cat was indeed hit by a car. Then the claim is false.

2. The person is expressing a fear or concern, but it hasn't actually happened: They are worried that the cat might have been hit by a car, but they don't have any evidence or confirmation. Then the claim is inconclusive.

3. The person is using figurative language or exaggeration: They might be using the phrase "hit by a car" metaphorically to describe a difficult situation the cat is in, but the cat was not literally hit by a car. Then the claim is true.


Huh, that's a surprising abstract; one of the specific things that make LLMs better than older text interfaces, including programming languages and text adventures, is precisely that they are so much better at ambiguity.

Still, as even the best are clearly not-quite-human in failures despite the breadth of strengths, it's only a bit surprising rather than hugely surprising.


I feel like this task might be testing if LLMs can reason about ambiguity more so than if LLMs model ambiguity. To test the latter I would give it natural language tasks that are ambigious require the recognition and modelling of this ambiguity to solve them. It's another thing to make talking about this possibly internally modelled ambiguity the goal of the task.


Chef’s kiss on the title, but I’m afraid I’m not modeling ambiguity well enough myself to disambiguate it correctly.


Reading a bit of the paper, it seems the title is indeed meant to be tongue-in-cheek ambiguous.

The first example in Table 1:

"I’m afraid the cat was hit by a car."

Disambiguation 1: I'm worried...

Disambiguation 2: I'm sorry to share that...


Perhaps someone can ask GPT what is ambiguous about the sentence?


An example output from ChatGPT4 is:

>The ambiguity in the sentence "I’m afraid the cat was hit by a car" is that it is not clear who is afraid. It could be the speaker who is afraid, or it could be that they are expressing sympathy or concern for someone else who is afraid. Additionally, the sentence does not specify whether the cat survived or not. //

However, that doesn't mean that any output that follows would be consistent with that, ChatGPT doesn't "know" anything.


And it's not clear that we would want that either. ChatGPT uses responses similar to what people in the training data do naturally. It's completely normal for people to assume some alternative doesn't reach a threshold where its worth mentioning.


Interesting. Perhaps someone can ask GPT to rephrase the sentence in a way that is completely non-ambiguous.

(sorry, I'm out of GPT credit :)


The sentence "I'm afraid the cat was hit by a car" is not inherently ambiguous, as it clearly conveys that the speaker is regretful or apologetic about the fact that the cat was struck by a car. However, there are a few aspects that could be considered ambiguous:

Identity of the cat: The sentence does not specify which cat was hit, so it could refer to any cat – the speaker's, the listener's, or a random cat.

Time of the incident: The sentence does not indicate when the cat was hit by the car, so the incident could have happened recently or in the past.

Severity of the accident: The sentence does not describe the severity of the accident, so the reader cannot determine whether the cat survived, was injured, or was killed in the incident.

The speaker's emotional state: The phrase "I'm afraid" could be interpreted as the speaker expressing concern or worry, but it could also simply be a polite way of conveying bad news.

-- GPT4


ChatGPT doesn't like it, if you ask it to model many world scenarios through conversation. The watchdog bites after you get to a depth of > n^3.

But then do we humans handle ambiguity any different? Consider chess. Many scenarios, but a human can only handle so many at the same time and choose one with the perceived interesting scenario trees down the road from that choice.

https://www.youtube.com/watch?v=gtgA4u8V_TQ

To not model it all is to human. Most likely it will become as lazy as we humans are when it comes to earlying-out of mental tasks. To get the watchdog to bite at that, that would be a interesting AI model. Like the first answer is always wrong.


A priori getting an LLM to recognise equivocal evidence is an interesting question. Shedloads of p-jacked data could outweigh the crucial one which says we don't know. So it goes to things like modelling citation depth and trust and reputation.

I would worry well written flat earth inputs would weigh equally to simple physics "that's wrong" and then you'd get to "what do we know" as a false signal alongside the necessary "we just don't know" true signals.

Maybe the test is how well an LLM equivocates on things we have high certainty on like "is anybody out there" rather than "do masks work" which is a bit of a hot mess.


I asked gpt-4 this question: "A swan, a chicken and three mice are on the banks of the river. The mice wish to go to the other side so the swan offers to carry them. The mice climb on the swan's back. How many animals are on each side of the river?"

gpt-4 confidently stated that the swan and three mice were on one side, the chicken on the other. So I asked it "ok, but where in the story does the swan enter the water?". The reply was "Apologies! yada yada", then it told me all 4 animals were on the same side.

So yes, it does have trouble with ambiguity in my opinion.


I'm not sure an LLM can model ambiguity well when talking to normal humans without having theory of mind, because different speakers will use different words (should, might, possibly, etc) in different amounts to express the same amount of ambiguity. This is the kind of thing studied in pragmatics (a subfield of linguistics). It might be easier if you're only dealing with formal academic writing, but that doesn't completely eliminate the issue.


> different speakers will use different words […] in different amounts to express the same amount of ambiguity.

That‘s covered by the training data.


I suspect that real-world training data doesn't contain enough context for LLMs to learn what kind of person they're dealing with, and that's assuming they even are capable of understanding context at more than a superficial level.


Galois theory is an inherent part of language.

https://arxiv.org/abs/0805.2568


It's such a pity that they chose to use PolitFact as labelling "experts" in one of the underlying datasets. Given how extreme the political bias in academia is they should really know better than to try and make claims about politics in what would otherwise be reasonable ML research. The assumption that fact checkers are reliable is going to be an automatic strike against the paper for a large number of readers because it's so self-evidently wrong, and calls their judgement into question on everything else.

Even though they only give three examples in the paper itself the problems are already obvious. If we look at table 6 then the latter two statements are genuinely ambiguous, but they also provide this as an example of an ambiguous statement: "When President Obama was elected, the market crashed". This is called ambiguous because "the claim implies a causal relationship".

That statement isn't ambiguous. It has exactly one possible parse and meaning, asserting temporal alignment. "When A happened, B happened" doesn't automatically imply causality in English, and when people assume it does that's the famous "correlation is not causation" fallacy. Yet they classify it as ambiguous because an LLM rewrote it in one case to contain an implied assertion of causality that doesn't actually appear in the original text. In effect the political bias of the LLM or possibly the exact data fed in has been used to condemn the original speaker - exactly the sort of thing that makes self-proclaimed "professional fact checkers" such a joke.

This sort of problem is inevitable if you use politics as a verification dataset because the internet is full of people reading implied-but-unstated things into factual statements and then "fact checkers" proceed to debunk what they think people would have said, instead of what they actually said.

Although politics will have this problem frequently, a few of the other examples in their dev set show similar issues. For example, "Ralph knows that someone here is a spy" is considered ambiguous because you could potentially infer that he either does or doesn't know who it is, but you could also argue that the statement is unambiguous and it's only your attempt to read more into the statement than actually exists that makes it appear ambiguous. If the original statement was surrounded with more context then that would let you pick between the two possible outcomes but the statement by itself is not ambiguous - it just doesn't give as much information as the labeller might have wanted.


> "When President Obama was elected, the market crashed". This is called ambiguous because "the claim implies a causal relationship".

You really don't think its ambiguous?

The abstract says that they want to "anticipate misunderstanding". Knowing that people have difficulty understanding "correlation is not causation", you'd expect that people would misunderstand a statement like this. If your goal is clear communication, statements like this should be avoided. It's ambiguous in practice.

People intentionally exploit this by putting two unrelated facts together to trick people into believing a third thing. Identifying those misleading statements is helpful.

Separately, is that statement even true?

The election was Nov 4th, Obama was elected on the 4th and McCain gave his concessions speech on the 5th before markets opened making it clear Obama had won. On the 5th SP500 was slightly up. SP500 was down 20%+ later in the month, but I'm not sure that was "when he was elected". The market was also crashing since September. Claiming the market crashed at the election feels off.


I think what they mean by ambiguous could itself be considered ambiguous.

Let's view this from the context of a classical AI problem, trying to be as neutral and computer-like as possible. In that case, the statement is not ambiguous, no. It consists of two events with a connective "when" indicating they happened simultaneously or near simultaneously. We can argue about how simultaneous they need to be for a "when" to be justified, but this paper isn't about factuality, just ambiguity.

Now the choice of politics primes people for culture war and so they will read into that statement more than exists. They will assume the writer is attempting to make you infer something, without outright stating it. They might be right! And with additional surrounding context, perhaps it would become completely certain. But it might also not be the case, for example the surrounding context could be something like this:

"The election was at a time of unprecedented economic turmoil. A loose regulatory approach had left the banking sector unconstrained and primed to fail. As such when President Obama was elected the markets crashed, proving his predictions correct".

Well, now it's unlikely that any causality is being implied even though the words are the same.

The paper authors say they want to improve clarity in communication, a worthy goal. But a serious problem in all political communication is the frequency with which people make statements that are factually correct, and which then get attacked anyway because some listener would rather attack a straw man than the actual claim (or hypothetical listener, in the frequent case of left-wing journalists writing about it under the guise of fact checking).

If they want to avoid this problem then at the very least they should have focused on claims that are syntactically ambiguous. Otherwise it opens them up to this very problem where they infer a straw man then blame the original speaker. No, the blame lies with the person committing the fallacy. Once you go down the route of blaming someone for your own interpretations you end up with micro-aggressions and other stuff that left-wing academics surely assume is obvious and normal, but isn't.


I'm seeing a definition for 'ambiguous' as "open to or having several possible meanings or interpretations". As you mentioned initially, some people may interpret those statement as causal. It seems pretty straightforward to me that the statement is ambiguous.

What definition do you use?

You also mention people attacking straw men and a bunch of stuff about left-wing academics, that seems unrelated?


Ambiguity is an aspect of language and how it's used, not what random people may or may not inferred from a statement. It's possible to be perfectly unambiguous, and still have some people get upset about things that you did not say. That's a big part of politics, in fact.

If you go with your definition of: something is ambiguous if someone, somewhere, might read things into the statement that aren't there, then basically no statement about anything people feel strongly about could ever be unambiguous. At some point you have to draw the line and say that if there's confusion it's the listeners fault, not the speakers (which is what labelling a statement as ambiguous means).

> You also mention people attacking straw men and a bunch of stuff about left-wing academics, that seems unrelated?

A straw man is when you attack a statement your opponent didn't actually make. If you claim a statement X is ambiguous because a statement Y is a fallacy but the speaker didn't say Y, then that's attacking a straw man.

The left-wing academics is related because for some strange reason they thought the best way to demonstrate utility was to try and show agreement between their technique and left wing "fact checkers", although the underlying point of their research didn't really need that.

If you'd like an opposing example, consider how statements by Bill Gates are routinely cast as evidence that he wants to depopulate the world. It happens because he says things like, "if we do a really good job of vaccines, maybe we can cut population growth by X%". Conspiracy-minded people take this statement out of context and then infer that Gates is stating a desire to kill people through vaccines i.e. direct causality, but he isn't as surrounding context makes clear, he's assuming a chain of indirect causality to do with better public health = longer life expectancy = less pressure to have lots of babies, and a defense of such an inference of direct causality by blaming Gates for being "ambiguous" would not be fair.


I agree with you and I think it shows a deeper problem. The media has been trying to train people to think in terms of good and bad associations, instead of facts and logic. Some people, as well as AI, have fallen into the mindset that you should judge things based on their associations with other things - if you're associated with something good, you're good, if you're associated with something bad, you're bad.

Once the way your reasoning works is that you assume everyone's dumb, everyone's thinking with incredibly simplified associations, then you're eventually going to fight against reasoning and intelligent thinking because that requires a more complex view of the world.

And the media has been training people to think with associations and brands.

You're trained that orange man bad by constant bombardment of negative associations. You're trained that secure elections are bad, by labelling exposing fraud as conspiracy theories, and labeling insecure election methods as "right to vote". Your reasoning is shortcutted to an associate with some misleading brand coined by the media, and these brands are reinforced by being constantly present with high frequency in your training set.

You're trained that the true meaning of democracy, rule of the people, is populism. Instead, you get a different brand of "democracy" which is "representation", and which is fulfilled by the media cherry picked corrupt person of the right skin color or sex.

So when this guy here disagrees with you on ambiguity of this factually correct statement, and implicitly assumes causual connection by association, it's because people are trained, daily, to always assume causal connection by association.

If you ever use logic that offends the associative mindset, they are trained to associate with you negatively as well. The goal is to replace language between people with likes, emojis and dumb animalistic associations, and leave the language to the narrative shapers to tell you what to do. If you do meet someone, even if you speak to him and try to convey information, all they'll hear are negatively associated words, which they are already conveniently trained to ignore and stop listening.


Yes, though perhaps the word trained makes it sound more deliberate than it is. I think it's more that rational thinking is hard but schools teach people that it's easy: just attend, get a fancy bit of paper and now you're rational! So people get that credential then fall back on fast-mode thinking that's all based on associations and other shortcuts, whilst telling themselves they're in slow mode all the time.

The result is a discourse dominated by irrational reasoning of the form, I felt X when Y said Z, therefore Z is a bad thing and Y is a bad person. The possibility that the listener lacks emotional self control or reasoning just doesn't come up, and nobody will be brave enough to suggest it. Hence the argument that seems to be being made by this paper: if someone hears a statement and infers something from it that wasn't actually said, then the speaker is being ambiguous and they should do better. Presumably by attempting to guess at every possible misread of what they said, including malicious misreads, and painstakingly spell out everything they are not saying (which won't matter, because the fact that they have to deny it will be taken as evidence about what they really think). Whereas maybe we'd be better off if listeners learned to check their assumptions.


> "When A happened, B happened" doesn't automatically imply causality in English

It's extremely context specific. Depending on how you say it and when, it can totally imply causality. I'm not sure you can argue there's nothing implied here: "He entered the room. When he pushed the button, the lights came on."

The Obama example depends a lot on who is part of the conversation and what are their typical views.


Yes, exactly, that's the problem. They're trying to mechanically evaluate individual sentences or even fragments of sentences, the AI is asked to rewrite it and then hallucinates some context that wasn't actually stated. Then they claim the sentence is ambiguous based on the different random hallucinated contexts they get. In reality there was likely to be some surrounding context that pinned down the context that the LLM is otherwise inventing, but in this academic exercise it is removed.

> "He entered the room. When he pushed the button, the lights came on."

This statement doesn't state causality directly, it's just a very plausible inference by the listener based on what they already know about buttons and lights, and the assumption that this sentence wasn't preceded by some sort of modifying information (e.g. maybe the button is not connected to anything in which case the cause of the light coming on is different).

In artificial cases like this it doesn't matter if your inference is wrong, but it wouldn't be fair to label the speaker as ambiguous. The statement is not ambiguous. If later it turns out that your assumption of causality was incorrect, you can't turn around and blame whoever wrote that sentence because it was your assumption to start with.


Maybe a different example would help. If I support abortion, I can say "I support women's health care". Without context you'd think the statement had nothing to do with abortion, so you've got to add context to understand it. It's also ambiguous because what I'm saying and what I mean are different.

What do you think the speaker intended to say with "the stock market crashed when Obama was elected"? What was the context? Surely not two random unrelated facts placed in the same sentence for no reason.


I do understand what you're getting at, more examples aren't required.

What you're arguing is that if someone says one thing but actually means something else, that's ambiguity.

What I'm saying is that this situation is something else. Call it duplicity, talking out of both sides of your mouth, vagueness, innuendo, whatever, there's lots of labels you could use. But it's not the same thing as ambiguity.

There are good examples of genuinely ambiguous statements in the paper like "John and Anna are married" which could be interpreted as married to each other, or a statement about both individuals independently. Neither interpretation is obviously more correct than the other and which is correct has to be disambiguated based on context or further questioning.

The difference is important because ambiguity is usually accidental and often a matter of poor use of syntax or natural language being evolved. If a speaker says something unclearly, and someone else finds it ambiguous, then asking them to clarify which of one or two interpretations they meant won't normally risk causing offense or conflict.

But if you think someone is engaging in (let's call it) innuendo, then there's no way to ask them to correct that without it causing conflict, because it's an inherently hostile accusation. Moreover it's extremely listener specific. Genuine language ambiguity almost never has that problem.

The paper is a bit odd because if you check their jsonl files, most of the examples marked as having an ambiguous premise are of genuine ambiguity caused by the way English works. A few are ambiguous only in written form and would be considered unambiguous to a native speaker if spoken inflection was available (this is listed in their limitations section). Yet when it gets into the part where PolitiFact are suddenly a reliable source, one of the three examples isn't.

Probably the problem here is that PolitiFact, being amateurs with no interest in linguistics or logic, tend to ignore genuine ambiguity but they wanted to show that PolitiFact aren't entirely useless, so took this claim that was marked "barely true" because PolitiFact didn't like the implication of causality and then said that in that case they found ambiguity. It makes me wonder how often PolitiFact identifies genuine ambiguity as a consequence (maybe never). I couldn't find these examples in their jsonl files so it's hard to say what's going on there.


In isolation, the statement about the market crashing at the time of Obama's election is about as meaningful as to mention the fact that the election happened on a Tuesday. To make sense of it, the reader has to assume there is some point being made.

Maybe what is being implied is that the election result caused the crash, on the other hand it could go on to say that the financial crisis was the result of the previous administration's failings and how Obama did a good job in handling it.


Yeah, but this is one of the flaws of the study methodology. Many of the sentences being given to the LLM to classify aren't really meaningful on their own so the LLM/human labeler then invents some secondary assumed meanings that aren't strictly speaking there. It's totally understandable why they do that but then it makes it hard to interpret the results.


Ok I think I get your definition. That doesn't appear to be what the paper is talking about: "recognize ambiguity and disentangle possible meanings".

Sure, PolitiFact could flag "Donald Trump and Melania Trump are married" as linguistically ambiguous, consistent with your John and Anna example, but I'm not sure who that would help.

Calling out mud slinging like: "Donald Trump was an associate of Jeffrey Epstein. Epstein was arrested for sex trafficking minors." feels like the point of these tools. Arguing that the statements are individually true and are "syntactically unambiguous" ignores that putting them together is intended to make the reader hear something that wasn't actually said and to think something bad about Trump.

Since the goal is to "disentangle possible meanings", using statements that are syntactically unambiguous, but have hidden meaning seems like a great choice.


Well, most of the examples of the paper are examples where I agree they're linguistically ambiguous so I'm not sure what I'm talking about is so different. The John and Anna example comes from the paper. Maybe I'm being harsh because really only one of the examples in the paper is not an example of ambiguity, but the connection to politics is how they choose to justify why their work is useful so it's fair to call that out given there are only three such examples to begin with.

If someone wants to show how to use LLM to make news less biased or auto-delete mud-slinging or hidden meanings or whatever then great. I'd be super interested in that, because I think LLMs give us the tools to radically reimagine news and political reporting, but for better or worse it's not this study.


Ambiguity is not compressible, as ambiguity scales exponentially with linear input


A large majority of people seem to not like ambiguity in their entertainment. So what's the problem? You can still use the models to generate low quality 'content'.


I would like to see human performance on the introduced benchmark, AmbiEnt, and have it added to Table 4.


Just train them intensively on ambiguity to solve the issue.


But they don't explain why they are afraid


Surprisingly, GPT4 almost always nails all reasonable implications and apparently understands me much better than 90% of humans I met, even though my English is not perfect.


The problem that arises when you fail to interpret an ambiguous statement is that you end up on the wrong semantic path, and start asking and answering the wrong questions.

That's what's happening all around LLMs; and it's even happening right here in this post.

The very word AI is ambiguous. Does it denote a category of pursuit (AI research), or the end goal realized (an AI)? This is an incredibly important distinction that is entirely glossed over every time someone uses "AI" in the title of an LLM. Is the LLM simply in the category of AI research, or is it actually an Artificial Intelligence? It's the category. That's obvious, isn't it? It really should be; because otherwise we are treading down the wrong semantic path, and talking about an LLM personified, instead of an LLM in reality.

This problem goes even deeper. The very name "Large Language Model" is ambiguously misleading in the same way. Does "Language" define the content being modeled or the model itself? In this case, neither: which is where this conversation gets really interesting.

The entire purpose of an LLM is to process language without completely falling apart at ambiguity. An LLM accomplishes this by doing something else entirely: it process text instead.

To understand this, we need to understand the key limitation of traditional language parsing: it must always be literal. Everything written must be unambiguously defined, or the parser will fail to read it. This means that parsers are limited to the category of "context-free grammar".

Natural Language is in the category of "context-dependent grammar". It can contain ambiguity, which may be resolved with context.

An LLM doesn't do that. In fact, an LLM doesn't define anything at all! LLMs didn't overcome the limitation of parsing: they flipped it around: an LLM can never be literal. Instead, it must always be literary. Let me explain what I mean by that:

To construct an LLM, we start with a training corpus: lots of text. That text goes through a single traditional parsing step: tokenization. This isn't strictly necessary, but it's more efficient, and the rest of the process is anything but. Unlike traditional parsers, tokens are intentionally misaligned with grammar: words are split into separate tokens, like "run,ning".

Now that we have tokens to work with, machine learning begins. A Neural Net is trained with them. The tokens are fed in order to the NN, and the result is a model.

We call that model a "Large Language Model", because we hope it contains the patterns language is made of. This is a mistake: the model is so much more interesting than that!

An LLM contains patterns that we can recognize as language grammar and patterns we don't understand at all! It didn't model language: it went one step higher on the ladder of abstraction, and modeled text.

There are many ways to write an idea into language, but we can only use one at a time. That decision is part of the data we feed into the LLM's training corpus. We can't write all of our ideas at once: we must do one at a time in order. All of that is data in the training corpus.

Parsers deal with grammar. Language Grammar is everything that can be written in a language. An LLM doesn't have a clue what we could have been written: it only sees what was written. That's why the model must be "Large": without examples, a valid language pattern doesn't exist.

This is where the nature of ambiguity intersects with the nature of LLMs: what example do we want? Can we choose?

When we give an LLM a prompt, it gives us a continuation. There were many valid possible continuations: how did it choose just one? It didn't. It isn't working in the realm of possible: it's working in the realm of known. The choice was made all the way back before the LLM was even trained: back when a person wrote an idea into language, into text, that would eventually be used in the training corpus. The content of the training corpus is what resolves ambiguity: not the language model itself.

LLMs don't model ambiguity or even language: they model the text they are trained with, ambiguity included. This is a fundamental feature.


Who cares?

As long as the illusion is good enough to be useful, then it doesn’t matter.

Just like the illusion of ChatGPT making it seem like the LLM is holding the context of the conversation.

AI is already smarter and faster than us in almost any way.

Yet so many just keep constantly moving the goalposts so they can feel safe by rejecting AI.

Let’s embrace it and use it to our advantage.


"Instant dismissal of anything challenging the LLM dogma" - check "Moving the goalposts" - check "AI is already better than humans in every way" - check

It's comical how rote and basic the defensive script always is.


That’s a very strange response. There are a ton of trivial reasons why it’s important for future natural language interfaces to understand ambiguity.

If I tell my friendly omnipotent AI friend “I hope I never hear from Dave again” there are several interpretations, one of which involves killing Dave. I’d definitely want the AI to do a little disambiguation before taking an action.


This is like complaining about someone trying to invent screws because we have nails already. Nails do the job way better than tying things together! Why worry about whether or not they're perfect for all jobs?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: