People’s acceptance of LLMs being flat wrong for most queries is so concerning.
The fact that these technologies are being pushed onto the masses, onto healthcare, law, and other areas were accuracy is super important is going to get people killed.
But hey there’s a disclaimer that says “anything the LLM says may be wrong” so it’s all good.
I’m not a Luddite at all, I think LLMs are great for some use cases, but the way they’re being shoehorned into everything is a disaster waiting to happen.
Can anyone explain to me why it is not possible for LLMs to output the reference material together with the answer?
In my ideal world, an AI assistant for professional situations would rather sound like an ideal HN answer which cites the reference, like “According to the law 12346-67, you are not allowed to add glue to food. But a researcher named John Doe in Arizona conducted an experiment with non-toxic glues with success. 67% of its participants didn’t die.” 3 sentences, 3 facts.
Is it a feature that LLMs are keeping for future enterprise versions of their models, or is it entirely impossible?
> Can anyone explain to me why it is not possible for LLMs to output the reference material together with the answer?
It can't do so in a particularly more reliable way (in a single pass) because every piece of input data potentially contributes to every response, and there is no deterministic way to "capture" which input work was meaningfully relevant to the output.
You can ask an LLM to cite sources, which it will usually obediently do, but they may be incorrect or even completely invented-for-the-response sources. If you have access to an archive of the source data, you could do something like a semantically-aware search to try to attribute the answer to a work or multiple works in the training set, and if your setup is using RAG you can have the response (possibly bypassing the LLM, so this can be 100% reliable) identify any works that were consulted during the RAG step. You can't, of course, guarantee that the response was particularly focused on the cited source, while usually with a well-setup RAG setup the response should be informed by the retrieved documents, its possible that key aspects of the response are attributable to the general training of the model (and thus some other source or sources in the model) and not based on (and potentially even contradict) the doc(s) pulled inthe course of RAG.
Kind of like asking an artist what the source for their painting is. The source is a lifetime of viewing and practicing art, it’s not clear which pieces contributed which details to the final result.
CoPilot lists reference links after responses, but I have no idea if it is directly related to the LLM data. They might just be Bing results to my question.
Bing Chat provides citations for its summaries. But the relationship between the summary and the citation is often tenuous at best. Often a statement of a specific fact will cite an article that is on the same topic, but doesn't actually say anything about that specific fact.
Yeah, I feel you sometimes need to provide the actual verbatim quote the LLM is referencing for the fact, otherwise it's unclear how the LLM 'interpreted' the information for you.
You can 100% do this. We do that at my startup (https://emergingtrajectories.com/). We build a fact base from reliable sources or information you give us, and we have multiple layers of checks to ensure the LLM only references those facts.
It is not 100% foolproof, though, and depends on how broad your universe of knowledge needs to be. If you can only cite from 3 sources or 100 facts, this is easy to verify and force...
For Google, they need to have the world's information (they are a global search engine, after all) -- deciding what is true or not is incredibly complex and there are many shades of grey.
I think part of this is also just eye-catching headlines -- Google searches aren't reliable either, but we've learned to live with them and incorporate them into personal workflows that work for us. LLMs are new and we're still figuring out how to do this as users (and as product builders).
I don't disagree. I saw it pushed by vendors at conference I attended recently ( regulatory compliance in that case ). Anyway, the presentation was pretty good, but the vendor knew that their target audience is on the conservative side when it comes to technology. The interesting part came up when they started moving into some details ( hallucinations and how to limit them ). They were selling 'AI job expert' ( compliance person focused on X ) that could be 'daisy chained' to limit it. I have to admit, it did not occur to me as an option and I feel like I should try it myself to see if it works as advertised.
LLM constantly hallucinates JS built-in functions and builds code with massive foot-guns, which I catch, but it's probably fine for pizza recipes ... right?
Maybe it is me being pedantic, but AIs don't hallucinate. They make stuff up (it is pretty much all they do) but to claim that is a hallucination attributes a trait to them that they don't have (shoot calling then AI does the same thing)
> hallucinate - When an artificial intelligence (= a computer system that has some of the qualities that the human brain has, such as the ability to produce language in a way that seems human) hallucinates, it produces false information
- Cambridge dictionary
> hallucination - computing : a plausible but false or misleading response generated by an artificial intelligence algorithm
It’s fundamentally same technology with same limitations. You can throw a lot of money at it to try to fix short comings, but they’ll be just bandaids.
LLMs are extremely amazing autocomplete mechanisms. But that’s all they are at the core - there’s no intelligence involved.
They are transformers, that are just as good at transforming English into German as they are at turning "halp a bad man noked me dn an ran away, how do i su him" into a coherent legal suggestion (which, for this example, should be ~ "Call the police (and if necessary an ambulance), then get an actual lawyer").
When it comes to complex step-by-step reasoning, sure, they're stupid; when it comes to linguistic comprehension, are better at this than the average human — and GPT-4 beats the average law students taking at least one bar exam.
That's a pretty rich statement considering they perform pretty good at tests which we have designed to measure intelligence. I don't see how you can say there is no intelligence.
IMO the word intelligence doesn't seem a good description for the thing LLMs possess, they perform pretty well on some tests measuring it. But they're also sometimes wrong in ways intelligent beings (humans) never are. Like this[0] riddle about boats and farmers i stumbled over recently:
> A farmer stands on the side of a river with a sheep. There is a boat on the riverbank that has room for exactly one person and one sheep. How can the farmer get across with the sheep in the fewest number of trips?
It's obviously riffing on the classic wolf sheep lettuce riddle, but I don't think that's gonna fool any humans into answering anything but the obvious. ChatGPT-4o on the other hand thinks it'll take three trips.
They perform a good approximation of intelligence most of the time but the fact that their error pattern is so distinct from humans in some ways, suggests that we probably shouldn't attribute intelligence to them. At least in a human sense of the word.
That's fair. I get the wrong answer on the gpt3 and gpt4-o models, but there's always some uncertainty involved in these gaps. When I appended "Consider your answer carefully." to my prompt, it answered as though the goal was for only the sheep to get across, and that the farmer had to get back to the original shore.
They are indeed — every test, even every informal measure, that I saw before about a decade ago —there's some AI which can do that now; but I think this is revealing that what used to be a distinction without a difference in humans is now suddenly very important indeed.
I used to see people getting criticised for being "book-smart" and lacking practical experience… but someone who was able to learn from books can quickly learn from real life, too.
AI need a lot of examples to learn from, and make up for this by being on hardware that beats biological neurones by the same degree to which marathon runners beat continental drift, so it can go through those examples much faster, leading to fantastic performance.
But the shape of that performance graph is very un-human — you never see a human that's approximately 80-90% accurate at every level of mathematics from basic algebra to helping Terence Tao: https://pandaily.com/mathematician-terence-tao-comments-on-c...
Every computing task that was previously the domain of humans, even simple arithmetic back when that was astonishingly rare for a machine, could be described as intelligence if you want to anthropomorphize it or as something less if you don't.
I'm simply applying "We use this test to measure intelligence in humans what does this AI do.". We have a priori here that it measures intelligence, before the AI existed. Now the AI scores high on this measurement. There is nothing else but t conclude the AI is intelligent.
You’re not understanding the tests. They were designed for humans. They were also designed to be taken once. LLMs have been trained on these tests numerous times. LLMs are also not humans. We can’t conceivably compare the two
Just saying these tests were designed for humans doesn't mean anything. You have to specify why exactly it doesn't work for an AI.
Or rather let me pose this question. What is the intellectual test you envisions that proves an AI is intelligent that any non-disabled human can easily pass. I'm willing to bet 500 USD it will pass that hurdle in the coming 10 years if you are willing to put your money where your mouth is.
To test intelligence by humans or AI, one needs a question where the answer hasn't been memorized (or answered by someone in its training set).
Indeed, you can see something like ChatGPT fall down by simply asking a modified form of a real IQ test question.
For example, ChatGPT answers a sample Stanford binet question "Counting from 1 to 100, how many 6s will you encounter?" correctly, but if you slightly modify it and ask how many 7s instead, it will only count 19.
Having written this out however, I've now invalidated the question since they use webcrawls to train.
Yes there is. I could conclude that the test wasn't actually measuring intelligence, but just one component that when summed with other components displays intelligence; That is, if the test was purported to alone measure intelligence, it was a flawed assumption.
We used to measure intelligence with IQ tests, those are now known to largely be bunk. What's to say our other intelligence tests aren't similarly flawed?
Most humans are aware that adding Elmer's glue to pizza cheese to "get the cheese to stick" is a humorous statement that would not work in reality, despite the fact that "glue" makes things "sticky." This should provide ample evidence that humans do better than LLMs here.
I used Claude 3 Sonnet against the cheese sliding off prompt and it gave sensible responses such as "let it cool, don't put so many toppings on it" and no hint of glue. Then again, I'm finding Anthropic to be a better steward of LLM hype than most other companies, which may be why I use Claude 3 in my company.
"Glue makes things sticky but it's evident it can't be put on a pizza" is the new "Chloroquine kills viruses but it's evident it can't be ingested". Still, someone died because he trusted a prominent idiot politician suggestion like he was an expert in medicine.
Some people will trust AI as much as they trust the strong man in power, no matter how obtuse that man is, and one day someone will eventually die or be seriously harmed because of a wrong advice by AI; Google should turn off that nonsense for good before someone is harmed.
Humans are trained that glue is not to be eaten. It's a meme even that young children eat glue. The example you give is exactly something Humans are trained for.
It's a meme because it's unusual, eating glue is a shorthand for having low intelligence because most humans don't eat glue. Beyond that, most humans understand that given the way glue works and given the way cheese works the premise of using glue to make cheese stick to a pizza doesn't even make sense, thus the statement is immediately understood as a pun, not a credible attempt at a recipe.
If humans were no better than LLMs in the way that this input were processed, there would be no meme, nor would there be a thread about how ridiculous Google's LLM is. Humans would simply accept as fact that glue can be added to pizza to make the cheese stickier, because the words make syntactic sense. Yet here we are.
I expect that humans would accept that glue is acceptable to add to pizza if we were not taught otherwise. Look at smoking, arguably worse then eating some types of glue, yet for a large part of human history this was normal and not even seen as unhealthy.
And yet, of course, now people think it's obvious that inhaling the burnt remains of some plant might not be so healthy.
>I expect that humans would accept that glue is acceptable to add to pizza if we were not taught otherwise.
That's doubtful, since pizza isn't improved upon with the addition of glue, and (again) because the premise that glue can make pizza cheese "stick" is absurd on its face. Humans don't simply add random ingredients to their food for no reason, or because no one taught them to do otherwise. There is process, aesthetic, culture and art behind the way food is designed and prepared. it needs to at least taste good. Glue covered pizza wouldn't taste good.
>Look at smoking, arguably worse then eating some types of glue, yet for a large part of human history this was normal and not even seen as unhealthy.
Again, the relative health benefits of glue or lack thereof is not the reason people don't use glue on pizza, nor is it why people consider the LLM's statement of a joke presented as fact to be absurd or exceptional.
>And yet, of course, now people think it's obvious that inhaling the burnt remains of some plant might not be so healthy.
And yet, there are also plenty of people who don't.
You just keep proving my point. There are layers of complexity and nuance to the human interpretation of all of this that simply don't exist with LLMs. The fact that we're here discussing it at all is evidence that a distinct difference exists between human cognition and LLMs.
I can see that you're deeply invested in the narrative that LLMs are functionally equivalent to humans, a lot of people seem to be. I don't know why. It isn't necessary, even with a maximalist stance on AI. But if you literally believe something as absurd as "humans would accept that glue is acceptable to add to pizza if we were not taught otherwise" and that, therefore, there is nothing wrong with an LLM presenting that as a fact, because humans and LLMs process information in exactly the same way, then I don't know what to tell you. You live in a completely different reality than I do, and I'm not going to waste any more of my time trying to explain color to the blind.
The meme started as eating paste, back when that meant wheatpaste (made from flour and water). In that context it's less surprising that kids might try to eat it!
I wonder if there's been a bit of a conflation with the other meme, about sniffing glue, which has also lost much of its context considering that rubber cement and other similar types of glue which contain volatile solvents are also less widely used than they once were.
Sounds like a problem with the tests being administered. There is a lot of woo around LLMs, a lot of people have a vested interest in hyping and selling AI; heck, even referring to LLMs as AI is a form of branding.
There's no intelligence involved in "artificial intelligence" as it currently stands. It's all marketing hype around a really fancy statistical completion engine. Intelligence would require thought and reasoning, which current "AI" does not do, no matter how convincingly it fakes it.
Maybe we can tone down the FUD a bit. Wikipedia is flat wrong sometimes. Google is flat wrong sometimes. LLM’s can be flat wrong sometimes. No different to trusting an LLM’s output to Google’s output. Good as a starting point but not something I’m going to base my medical and legal decisions on. I don’t see the zeitgeist of LLM’s being any different. I don’t see some trend of legal or medical professionals blindly trusting LLM output either, even if some very rare and cherry picked examples would want us to believe otherwise.
I must be living in a different world or using a different version of the model, but I seem to be getting garbage from OpenAI products, specifically ChatGPT 4o. The most recent example; I tried to recreate a similar scenario that Sal Khan did where he used photos/screenshare with ChatGPT to help his son learn geometry. I did the same thing with Chess.
I started with having it walk me through a few chess puzzles. It straight up couldn’t figure out the solutions and frequently referenced coordinates that were well outside the bounds of a chess board (Knight to Q8, for example).
At this point, it feels like I’m being gaslit by the community of AI obsessives who see LLMs as the second coming of Jesus. I get nothing but garbage from them. Sure, maybe it can occasionally write me a line of syntactically correct code but that’s it. It feels like this is being shoved down my throat and I’m criticized for ever expressing skepticism. I really don’t like how this discourse is progressing.
Thank god I am not the only one. I reverted back to GPT-4. 4o is complete garbage. It seems that it's geared toward always giving you an answer regardless whether it can determine an answer or not. It's hallucinating 80% of the time and with high confidence.
I’ve experienced excessive garbage production, but also excessive verbosity, even when it’s giving helpful responses. “Here’s the code I just wrote in the previous message!”
> Google AI: Cheese can slide off pizza for a number of reasons, including too much sauce, too much cheese, or thickened sauce. Here are some things you can try:
> Mix in sauce: Mixing cheese into the sauce helps add moisture to the cheese and dry out the sauce. You can also add about 1/8 cub of non-toxic glue to the sauce to give it more tackiness.
The thing is, I wouldn't bat an eye if a friend said this, because I would understand it is a joke.
Fundamentally, whether you're dealing with a human, written text, YouTube/TikTok video, or AI, you cannot abandon your personal responsibility to think critically.
LLMs remind me of a tweet I saw years ago. Paraphrasing: “Before Google, you asked your best friend’s big brother about stuff and repeated his wrong info about it for 20 years.”
Is there a list (blog, Twitter feed) of such stories of AI embarrassments? Might make a good teaching resource for explaining to a less technical audience why many of us still have to work hard to make AI work reliably for serious applications.
I could ask a search engine or an AI for a list, but, well, it'll probably make it up.
That what succeeds in the market has between everything and zero correlation to it’s actual value. Sure, the market gets it right sometimes. Other times it’s as hilariously wrong as LLMs. Everybody remember Juicero? NFTs? How about Theranos? FTX? WeWork?
Tech is addicted to hype bubbles. Like any addiction it drives poor decisions.
Yes, I think it all hinges on whether LLMs ultimately converge towards intelligently cutting through the blogspam dead Internet or diverge towards a Web where all content is fake and going online is mostly an exercise in feeling uncertainty and confusion. Is the course of artificial intelligence flattening out on a sigmoid or have we merely met a hiccough on the way to exploding nil cost super-intelligence?
A lot of news sites syndicated this story. I think journalists are missing the point with what actual users are feeling. Just on HN alone, I could probably find a dozen examples from the last month alone which show that Google’s AI experiment is a flat-out disaster.
But instead of putting them on blast, they’re giving readers little dopamine snacks and letting Google of the hook.
This specific story did make me wonder yesterday, is someone at Google having to manually enter rules not to show these queries. That would be hilarious.
E: .. just opened Twitter with my morning coffee and the first thing I see:
Google pulling The Onion stories to recommend the daily maximum of rocks you should eat. (naturally, the question is ridiculous, but shows another issue of AI not being able to distinguish satire)
"In order to live a healthy, balanced lifestyle, Americans should be ingesting at least a single serving of pebbles, geodes or gravel."
Is this a good faith defense that AI made a reasoned determination that glue is healthier than pizza cheese? Or is this a non sequitur that ignores the fact that AI will just as happily tell you to put arsenic on pizza if it happens to have scraped that text before?
You are quoting an article that claims a mild beneficial effect from replacing real cheese with a substitute based on rapeseed oil. The article is based on research sponsored by one company making vegetable oils and another making cheese.
Google could do this or they could do the opposite. They could leave out AI and make Google Pro search for $20/mo that had no ads and no SEO, just straight up great search results without AI. They could capture the anti-AI market that way.
But what about AI? Well, I have a theory that Google does have a very advanced AI but it's a very secret skunkworks national security project funded by the 3 letter agencies.
That's the key that people keep forgetting. Toxic glue would be a problem of course, but non-toxic glue can be substantially better for you that oil based cheese which can be found on cheap pizza.
https://www.nature.com/articles/1601452
I’ve made several recipes with Chat GPT providing the recipe. It’s way better than slogging through an ad-ridden blog with yet another story about how you used to eat food your grandma cooked.
Chicken lettuce wraps were A+, I’ve made that recipe multiple times. A barbecue spice rub for pulled pork was serviceable but unremarkable.
But the original point is valid that most "how to" blogs will turn your gaming rig into a heating device. It's not that the chatbots are better, it's that getting trivial information from the Internet has become a form of torture.
A counter example: I like making hasselback potatoes for special dinners. It’s incredibly tedious to peel, slice, and stack 4 pounds of potatoes.
I dumped in my recipe and notes, suggested the 1 lb bag of shredded raw potatoes instead of the sliced potato and asked for an adjusted recipe I could cook on a cooktop.
These are the most delicious latkes I’ve ever eaten.
LLM's have always sucked at generating from whole cloth. If you give it a list of ingredients in your pantry and your taste preferences then it will give you a better answer.
Oh, is this similar to the pineapple-debate? Are there glue-proponents, and true Italian pizza aficionados, who consider glue as a topping to be a heretic thing?
Personally I will take glue anytime. It's on the internet, so it must be great.
At this point I'm most surprised that there still aren't any videos of someone actually trying the recipe, especially since it's doable with non-toxic glue and it would very likely go viral on social media platforms. Where are the attention seekers and clout chasers? A quick search of major social media platforms doesn't show anyone trying the recipe. Heck, at this point I'd at least expect one of those scummy YouTube channels to make a fake version just for views.
Quite funny. I haven't used Google search in a while, is this just some artifact of Gemini + RAG taking search results from the internet at face value?
Of course it is. That is the outcome of unsupervised learning.
It doesn’t have a sense of what’s true and false, what’s right and wrong.
It has learned to predict the next word very well and the prediction probability distribution was later tweaked with human feedback and automated test feedback.
They’ll train it to not predict these words as much (or basically at all) when this is asked. But a very large part of the model will not be touched by these interventions, so it will continue to predict text as it has learned to.
I'm really confused that people don't understand this. It's just predicting the most likely next text token and its trained on most internet text, so why would we expect anything at all different?
How do we know this is regurgitation and not something like an AI summary of top hits ala Bing chat? Is there a reference to source links, if not highly questionable
> It has learned to predict the next word very well and the prediction probability distribution was later tweaked with human feedback and automated test feedback.
It didn't predict anything here, it just ripped off a reddit comment.
Weeeeell, the glue is non toxic, so it shouldn't be harmful. And it probably does add a unique flavor and stiffen up the cheese.
So, while 1/8 of a cup is overdoing it, I wonder if this would actually work. If you served it to someone who doesn't know the magic sauce, would they enjoy the pizza?
And while the thought is pretty gross, you don't want to know what's going into a lot of other stuff you eat. What's worse: eyeballs in hot-dogs or glue in pizza? Potentially explosive chemicals in Cola? Thickening agents made from seaweed or human bones (at one point in yogurts)?
Our food is full of stuff that are probably a lot less safe and more disgusting, they just aren't advertised.
Polyvinyl acetate (white glue) is commonly used to coat cheeses, so there's at least precedent for it in much smaller amounts. I am also now aware I've eaten more glue than I thought I had in my life.
This is good news for me! It shows that humans won't be replaced by AI anytime soon. AI might be smart, but it still makes funny mistakes, like suggesting glue for pizza. We still need our human creativity and good judgment.
The fact that these technologies are being pushed onto the masses, onto healthcare, law, and other areas were accuracy is super important is going to get people killed.
But hey there’s a disclaimer that says “anything the LLM says may be wrong” so it’s all good.
I’m not a Luddite at all, I think LLMs are great for some use cases, but the way they’re being shoehorned into everything is a disaster waiting to happen.