Hacker News new | past | comments | ask | show | jobs | submit login
Fixing Hallucination with Knowledge Bases (pinecone.io)
134 points by gk1 on May 4, 2023 | hide | past | favorite | 53 comments



Using vectors for retrieval augmentation seems really powerful at first, but in practice we've found it very finicky. How long should each chunk be? How should you split? Are vector embeddings actually a good space/distance for your domain?

One improvement is to use refinement (ie iterate over many relevant chunks, refining your answer). This is more expensive and slower but less lossy than vectors. LlamaIndex does refinement well.

Would be curious if people have found better patterns. We've been playing around with using one LLM chain to do the retrieval, then passing the retrieved chunks into the main LLM call. Seems to work better for certain contexts.


There's the REMO framework, from daveshap, but I've yet to try that algorithm.


I posed a similar question to a postdoc at CSAIL. Namely that generative image models like DALL-E consistently screw up the number of eyes and fingers. I wondered if you could use a knowledge graph in conjunction with the drawing algorithm to imbue more realism into the generation. At the time they said they weren't aware of research in that direction. Fair enough. Still interested in seeing if these purely generative models can reference knowledge and apply it.


Human features such as faces and fingers are easy for us humans to parse because our attention is instantly drawn to them, and we are highly sensitive to "noise" for that particular domain. AI-generated content of a three-toed animal, for example, that is pictured with four toes might slip a lot of people's minds.

I unfortunately don't see how vector database would be able to help with this. The knowledge bases that they provide are meant to enable large-scale retrieval and vector search. One way to fix the problem you mention is perhaps to increase the proportion of features that draw human attention (e.g. faces) in the training dataset.


> faces and fingers are easy for us humans to parse because...

because we have dedicated neural hardware for faces -- the "fusiform face area", a relatively small volume of the brain which is used by non-autistic humans for facial recognition/processing. A lot of our sensitivity to human faces is neural structure that's in our DNA, and yes that does pair with a lifelong obsession over human faces to spend huge amounts of energy learning more about them. But after infant brain development we're not relying on a blank slate for human faces.

> One case study of agnosia provided evidence that faces are processed in a special way. A patient known as C. K., who suffered brain damage as a result of a car accident, later developed object agnosia. He experienced great difficulty with basic-level object recognition, also extending to body parts, but performed very well at recognizing faces. A later study showed that C. K. was unable to recognize faces that were inverted or otherwise distorted, even in cases where they could easily be identified by normal subjects. This is taken as evidence that the fusiform face area is specialized for processing faces in a normal orientation.


Is it known what autistic people use it for? Different for each person?


It seems like generative image generators can’t count very well and we notice whenever the number of something is supposed to be fixed. They also screw up the black keys on a piano.


Generally, the models seem to struggle with fine details that correlate with broader features. Each finger typically looks ok, it's the hand that's off... remote controls might have appropriately detailed buttons with a nonsense layout. I wonder if the training process can "weight" various visual features instead of optimizing loss equally over the whole image.


You're close, but there's something even more fundamentally wrong that goes overlooked. The noise model. Namely the use of gausssian random pixel flips. The spectrum of the noise distribution exactly gets in the way of the fine details before there's any chance to learn it. There's no chance for it to learn the layout of TV remote buttons because in the first iteration of the diffusion there's already noise of the same length scale as the buttons you want it to learn.


What is a finger?

What is two?

These categories don't exist.

Categories don't exist.


They don't exist but try navigating life without them.


I was being vague, but I meant those questions in the context of an ML model.

ML models are completely inference-based. They have zero symbolic definitions. They don't categorize subjects at all.

Because of that, you can't tell a model how many X to draw.


Rant, I think the usage of the term hallucination in this context is entirely too anthropomorphic.


I recently watched a discussion with a neuroscientist who was explaining the left brain-right brain difference, as it's currently understood. He said the left hemisphere evolved to focus its attention on a small field of perception, mainly to find food. The right hemisphere evolved to detect threats from the entire field of perception.

The way we train LLMs, to my understanding, is to predict the next word. That's a narrow task, and in my interpretation is a left brain dominant task.

What can happen with the left brain is that it can become so focused on the task at hand, and so wedded to the mental models used to accomplish that task, that it loses touch with reality as broadly perceived by the right brain. The neuroscientist claimed schizophrenia is a disorder of too little inhibition of left brain inferences by the right brain. (I am taking him at his word on this.) The left brain's mental models lose touch with ground truth. This can result in hallucinations.

If I am correct that LLMs act like the left brain in that they are relying on models that they developed during training and focusing on a single task, absent sensory experience, then hallucination may actually be a good term.

I do think that the use of anthropomorphic terms is problematic because it suggests the same phenomenon, rather than an analogous phenomenon.


> The neuroscientist claimed schizophrenia is a disorder of too little inhibition of left brain inferences by the right brain.

This sounds very "bicameral mind"-y[1] to me, and it's worth noting that schizophrenia's causes haven't been pinned down accurately yet. There are a ton of hypotheses around how schizophrenia develops/works, but none are conclusive.

I bring it up because when you hear relatively simple explanations about the left and right brain and the tasks they're "assigned" or "designed" to process, or how their perceived differences contribute to mental illness, it should be taken with a grain of salt.

Similarly, I'd hesitate to compare LLMs or any NNs to actual human brains. Their similarities are entirely superficial, and beyond the very basic topology of NNs that were inspired by specific types of neurons, the similarities end there.

[1] https://en.wikipedia.org/wiki/Bicameral_mentality#The_Origin...


Heavyset_go wrote a good answer from one point of view, I would like to add another. Namely that "hallucination" the word means quite different things for a human and an llm. For a human a hallucination roughly refers to a sensory experience where you experience something that isn't there.

For an LLM it refers to making something up when answering a question. A closer phenomenon to that would be a false memory, but even that isn't quite the same thing.

The thing is that we want LLM's to learn patterns, we want them to generalize from its training data to a certain extent. Hallucinations are basically undesirable instances of this. The user expects to hear about somethign that does exist, but the answer they get is rather about somethign that the LLM has extrapolated might exist.


Outside of the title hallucination is only used ONCE in the whole article. Further still it is now a widely accepted term of art used to describe a specific behavior exhibited by an LLM that is separate from the biological one.

It seems anytime someone brings up these models being "anthropomorphized" they realize they have nothing relevant to say but feel the needs to say something and are falling back on a tired platitude.


Would you like to compare levels of understanding?


Well I think there's a contradiction of wants in having AI that's supposed to be the cutting edge of emulating us yet insisting that it shouldn't be anthropomorphized no matter how close it gets to emulating us.


These models are going to be anthropomorphized to death. It’s great marketing


Going to be?


I actually find it very reassuring that we still have no fucking clue as to what's going on - even on such a basic level.


What would you see as a better term/phrase?


Guessing.

When we give information we don't know to be true -which is what the model does when it builds low confidence tokens on each other- we call it guessing. Why would this be any different?


Guessing has a very mild implication of thought or estimation from a model of understanding of the subject in question.

Hallucination, on the other hand, is evocative of a less reasonable/rational source.


That would actually be an educated guess or a guesstimate. A guess is as good as any.


Maybe strictly definitionally, but comparing the connotations of the two, I assume people don't actually truly randomly guess.


Isn't that anthropomorphic too? It's humans doing guessing, not objects.


confabulation


Not sure as I’m not yet familiar enough with the types of errors occurring. I do think a somewhat more accurate and possibly technical set of terms would be more useful for discussions about it though than hallucination.


They're not errors.

These are token prediction models. Their token predictions are accurate, insofar as the probabilities of those tokens in those sequences being representative of the kind of thing they were trained on.

Is it plausible that a document exists that explains that George Washington was actually an amateur magician? How is a GPT model to know that that is not the document it is predicting the next tokens for? Why wouldn't it explain that he used to perform for children's parties, and do tricks involving making doves and rabbits appear?


Exactly right. These are statistical models that operate at a completely different level of abstraction from the human brain: tokens/words/etc for text models, and individual (or perhaps small groups) of pixels for image models. Plausible output is obtained through sheer brute-force, and there is no mechanism to correct what we perceive as errors.


Bullshitting

Or "high-loss example"


Bullshit implies active intent to deceive about your level of knowledge/understanding. In other words, it's even more anthropomorphic.


Yes I agree now. "High loss example" while not catchy might be closer. Although it is a bit backwards as you only think about creating the target once you sense the answer is not great.


Tripping


Bugs, faults, errors


Mistakes


I have a couple of questions:

1. Once you use the vector embeddings to grab the most relevant chunks, are you just injecting the actual 400 (in this example) token prose text snippets into the LLM query? So under the hood does that query from the article end up as something like "Who was Benito Mussolini? Please use the following texts to inform your answer: [snippet 1, snippet 2, snippet3]"?

2. I understand the use case for knowledge that isn't cooked into the LLM because it's too recent etc., but I wonder about using it with historic knowledge. I assume all(?) LLMs would have used Wikipedia for training and would therefore already have this Mussolini information from those same articles, so what's the point of priming it with duplicate "external" information? Would that really improve accuracy?


This is just "citing the sources" of the retrieved documents or am I missing something? I was expecting to read some novel approach like Hypothetical Document Embeddings.


Yeah it's not a novel approach, it's

1. Chunkify the knowledge database into smallish chunks,

2. When you get a question, find the chunks most similar to the question

3. Prompt the LLM to use [the found chunks] to answer [the question]

If you've ever posted a question to Stack Overflow or a similar site, think about the noise reduction “these posts may be relevant to you” feature. When you're actually asking an interesting and/or open ended question, you get a looooot of false positives. That's in the nature of text similarity search, it really hinges on strong nouns and verbs to anchor the discussion. It's no coincidence that the examples given here are uncommon names like Mussolini and Ketanji.

So while this is not terribly useful, it is interesting that it might severely reduce the hallucination rate, which is a sort of false positive rate. Does it only reduce the false positives for facts that it “knows” via the external database? The really interesting prompts and responses, where you ask it to answer questions about figures who do not exist or data too npew for it to have, see if it still hallucinates, would have been very nice to see in this blog post. Missed opportunity.


There was an open source vector database that was suppose to be an alternative to pinecone that was mentioned a few weeks back.

I can’t seem to find it. Does anyone recall the name?


Pgvector for postgres and vector search in Elasticsearch would be two ways to do it without adopting some new untested product.


Milvus (https://milvus.io) and Vespa (https://vespa.ai) are great choices if you're looking for hardened, scalable, and production-ready vector databases.

We (Milvus) also have `milvus-lite` if you'd like something pip installable: python3 -m pip install milvus


Not sure which one, but if you are after a vector search engine (not just a database) then I can recommend this https://github.com/marqo-ai/marqo. Includes inference, transformations, schema's, multi-modal search, multi-modal queries, multi-modal representations, text chunking and more.


This article also has some other methods for hallucination and reference checking by using cross-encoders https://github.com/marqo-ai/marqo/blob/mainline/examples/GPT.... It is not perfect but can be easily modified to make it more robust.


Potentially Chroma https://www.trychroma.com/ ?



Popular examples are Weaviate and Qdrant


Are there any memory-based projects on GitHub that I can try out myself without breaking the bank? Let's say I want to load up a documentation and simply do some queries with it and see how well it performs.


We tried a few different versions of this over a recent hackathon; results weren't great. Though maybe that says more about the documents that we trained it on than the approach!


The confabulation ship probably sailed, but maybe we can still turn it around?


Nobody is trying to use Cyc ?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: