Fixing Hallucination with Knowledge Bases

Ozzie_osman · on May 5, 2023

Using vectors for retrieval augmentation seems really powerful at first, but in practice we've found it very finicky. How long should each chunk be? How should you split? Are vector embeddings actually a good space/distance for your domain?

One improvement is to use refinement (ie iterate over many relevant chunks, refining your answer). This is more expensive and slower but less lossy than vectors. LlamaIndex does refinement well.

Would be curious if people have found better patterns. We've been playing around with using one LLM chain to do the retrieval, then passing the retrieved chunks into the main LLM call. Seems to work better for certain contexts.

lacasito25 · on May 5, 2023

There's the REMO framework, from daveshap, but I've yet to try that algorithm.

frakt0x90 · on May 4, 2023

I posed a similar question to a postdoc at CSAIL. Namely that generative image models like DALL-E consistently screw up the number of eyes and fingers. I wondered if you could use a knowledge graph in conjunction with the drawing algorithm to imbue more realism into the generation. At the time they said they weren't aware of research in that direction. Fair enough. Still interested in seeing if these purely generative models can reference knowledge and apply it.

fzliu · on May 5, 2023

Human features such as faces and fingers are easy for us humans to parse because our attention is instantly drawn to them, and we are highly sensitive to "noise" for that particular domain. AI-generated content of a three-toed animal, for example, that is pictured with four toes might slip a lot of people's minds.

I unfortunately don't see how vector database would be able to help with this. The knowledge bases that they provide are meant to enable large-scale retrieval and vector search. One way to fix the problem you mention is perhaps to increase the proportion of features that draw human attention (e.g. faces) in the training dataset.

reaperman · on May 5, 2023

> faces and fingers are easy for us humans to parse because...

because we have dedicated neural hardware for faces -- the "fusiform face area", a relatively small volume of the brain which is used by non-autistic humans for facial recognition/processing. A lot of our sensitivity to human faces is neural structure that's in our DNA, and yes that does pair with a lifelong obsession over human faces to spend huge amounts of energy learning more about them. But after infant brain development we're not relying on a blank slate for human faces.

> One case study of agnosia provided evidence that faces are processed in a special way. A patient known as C. K., who suffered brain damage as a result of a car accident, later developed object agnosia. He experienced great difficulty with basic-level object recognition, also extending to body parts, but performed very well at recognizing faces. A later study showed that C. K. was unable to recognize faces that were inverted or otherwise distorted, even in cases where they could easily be identified by normal subjects. This is taken as evidence that the fusiform face area is specialized for processing faces in a normal orientation.

35mm · on May 5, 2023

Is it known what autistic people use it for? Different for each person?

skybrian · on May 5, 2023

It seems like generative image generators can’t count very well and we notice whenever the number of something is supposed to be fixed. They also screw up the black keys on a piano.

losteric · on May 5, 2023

Generally, the models seem to struggle with fine details that correlate with broader features. Each finger typically looks ok, it's the hand that's off... remote controls might have appropriately detailed buttons with a nonsense layout. I wonder if the training process can "weight" various visual features instead of optimizing loss equally over the whole image.

IIAOPSW · on May 5, 2023

You're close, but there's something even more fundamentally wrong that goes overlooked. The noise model. Namely the use of gausssian random pixel flips. The spectrum of the noise distribution exactly gets in the way of the fine details before there's any chance to learn it. There's no chance for it to learn the layout of TV remote buttons because in the first iteration of the diffusion there's already noise of the same length scale as the buttons you want it to learn.

thomastjeffery · on May 4, 2023

What is a finger?

What is two?

These categories don't exist.

Categories don't exist.

leansensei · on May 5, 2023

They don't exist but try navigating life without them.

thomastjeffery · on May 5, 2023

I was being vague, but I meant those questions in the context of an ML model.

ML models are completely inference-based. They have zero symbolic definitions. They don't categorize subjects at all.

Because of that, you can't tell a model how many X to draw.

jfghi · on May 4, 2023

Rant, I think the usage of the term hallucination in this context is entirely too anthropomorphic.

jackcosgrove · on May 5, 2023

I recently watched a discussion with a neuroscientist who was explaining the left brain-right brain difference, as it's currently understood. He said the left hemisphere evolved to focus its attention on a small field of perception, mainly to find food. The right hemisphere evolved to detect threats from the entire field of perception.

The way we train LLMs, to my understanding, is to predict the next word. That's a narrow task, and in my interpretation is a left brain dominant task.

What can happen with the left brain is that it can become so focused on the task at hand, and so wedded to the mental models used to accomplish that task, that it loses touch with reality as broadly perceived by the right brain. The neuroscientist claimed schizophrenia is a disorder of too little inhibition of left brain inferences by the right brain. (I am taking him at his word on this.) The left brain's mental models lose touch with ground truth. This can result in hallucinations.

If I am correct that LLMs act like the left brain in that they are relying on models that they developed during training and focusing on a single task, absent sensory experience, then hallucination may actually be a good term.

I do think that the use of anthropomorphic terms is problematic because it suggests the same phenomenon, rather than an analogous phenomenon.

heavyset_go · on May 5, 2023

> The neuroscientist claimed schizophrenia is a disorder of too little inhibition of left brain inferences by the right brain.

This sounds very "bicameral mind"-y[1] to me, and it's worth noting that schizophrenia's causes haven't been pinned down accurately yet. There are a ton of hypotheses around how schizophrenia develops/works, but none are conclusive.

I bring it up because when you hear relatively simple explanations about the left and right brain and the tasks they're "assigned" or "designed" to process, or how their perceived differences contribute to mental illness, it should be taken with a grain of salt.

Similarly, I'd hesitate to compare LLMs or any NNs to actual human brains. Their similarities are entirely superficial, and beyond the very basic topology of NNs that were inspired by specific types of neurons, the similarities end there.

[1] https://en.wikipedia.org/wiki/Bicameral_mentality#The_Origin...

im3w1l · on May 5, 2023

Heavyset_go wrote a good answer from one point of view, I would like to add another. Namely that "hallucination" the word means quite different things for a human and an llm. For a human a hallucination roughly refers to a sensory experience where you experience something that isn't there.

For an LLM it refers to making something up when answering a question. A closer phenomenon to that would be a false memory, but even that isn't quite the same thing.

The thing is that we want LLM's to learn patterns, we want them to generalize from its training data to a certain extent. Hallucinations are basically undesirable instances of this. The user expects to hear about somethign that does exist, but the answer they get is rather about somethign that the LLM has extrapolated might exist.

ajcp · on May 5, 2023

Outside of the title hallucination is only used ONCE in the whole article. Further still it is now a widely accepted term of art used to describe a specific behavior exhibited by an LLM that is separate from the biological one.

It seems anytime someone brings up these models being "anthropomorphized" they realize they have nothing relevant to say but feel the needs to say something and are falling back on a tired platitude.

jfghi · on May 5, 2023

Would you like to compare levels of understanding?

IIAOPSW · on May 5, 2023

Well I think there's a contradiction of wants in having AI that's supposed to be the cutting edge of emulating us yet insisting that it shouldn't be anthropomorphized no matter how close it gets to emulating us.

klysm · on May 4, 2023

These models are going to be anthropomorphized to death. It’s great marketing

thomastjeffery · on May 4, 2023

Going to be?

lisasays · on May 5, 2023

I actually find it very reassuring that we still have no fucking clue as to what's going on - even on such a basic level.

nomel · on May 4, 2023

What would you see as a better term/phrase?

ajcp · on May 5, 2023

Guessing.

When we give information we don't know to be true -which is what the model does when it builds low confidence tokens on each other- we call it guessing. Why would this be any different?

eximius · on May 5, 2023

Guessing has a very mild implication of thought or estimation from a model of understanding of the subject in question.

Hallucination, on the other hand, is evocative of a less reasonable/rational source.

omneity · on May 5, 2023

That would actually be an educated guess or a guesstimate. A guess is as good as any.

eximius · on May 5, 2023

Maybe strictly definitionally, but comparing the connotations of the two, I assume people don't actually truly randomly guess.

duckmysick · on May 5, 2023

Isn't that anthropomorphic too? It's humans doing guessing, not objects.

ancaster · on May 4, 2023

confabulation

jfghi · on May 4, 2023

Not sure as I’m not yet familiar enough with the types of errors occurring. I do think a somewhat more accurate and possibly technical set of terms would be more useful for discussions about it though than hallucination.

jameshart · on May 5, 2023

They're not errors.

These are token prediction models. Their token predictions are accurate, insofar as the probabilities of those tokens in those sequences being representative of the kind of thing they were trained on.

Is it plausible that a document exists that explains that George Washington was actually an amateur magician? How is a GPT model to know that that is not the document it is predicting the next tokens for? Why wouldn't it explain that he used to perform for children's parties, and do tricks involving making doves and rabbits appear?

SrslyJosh · on May 5, 2023

Exactly right. These are statistical models that operate at a completely different level of abstraction from the human brain: tokens/words/etc for text models, and individual (or perhaps small groups) of pixels for image models. Plausible output is obtained through sheer brute-force, and there is no mechanism to correct what we perceive as errors.

quickthrower2 · on May 4, 2023

Bullshitting

Or "high-loss example"

LordDragonfang · on May 5, 2023

Bullshit implies active intent to deceive about your level of knowledge/understanding. In other words, it's even more anthropomorphic.

quickthrower2 · on May 5, 2023

Yes I agree now. "High loss example" while not catchy might be closer. Although it is a bit backwards as you only think about creating the target once you sense the answer is not great.

cloudking · on May 5, 2023

Tripping

rolodato · on May 5, 2023

Bugs, faults, errors

pmoriarty · on May 5, 2023

Mistakes

littlekey · on May 6, 2023

I have a couple of questions:

1. Once you use the vector embeddings to grab the most relevant chunks, are you just injecting the actual 400 (in this example) token prose text snippets into the LLM query? So under the hood does that query from the article end up as something like "Who was Benito Mussolini? Please use the following texts to inform your answer: [snippet 1, snippet 2, snippet3]"?

2. I understand the use case for knowledge that isn't cooked into the LLM because it's too recent etc., but I wonder about using it with historic knowledge. I assume all(?) LLMs would have used Wikipedia for training and would therefore already have this Mussolini information from those same articles, so what's the point of priming it with duplicate "external" information? Would that really improve accuracy?

mfalcon · on May 5, 2023

This is just "citing the sources" of the retrieved documents or am I missing something? I was expecting to read some novel approach like Hypothetical Document Embeddings.

crdrost · on May 5, 2023

Yeah it's not a novel approach, it's

1. Chunkify the knowledge database into smallish chunks,

2. When you get a question, find the chunks most similar to the question

3. Prompt the LLM to use [the found chunks] to answer [the question]

If you've ever posted a question to Stack Overflow or a similar site, think about the noise reduction “these posts may be relevant to you” feature. When you're actually asking an interesting and/or open ended question, you get a looooot of false positives. That's in the nature of text similarity search, it really hinges on strong nouns and verbs to anchor the discussion. It's no coincidence that the examples given here are uncommon names like Mussolini and Ketanji.

So while this is not terribly useful, it is interesting that it might severely reduce the hallucination rate, which is a sort of false positive rate. Does it only reduce the false positives for facts that it “knows” via the external database? The really interesting prompts and responses, where you ask it to answer questions about figures who do not exist or data too npew for it to have, see if it still hallucinates, would have been very nice to see in this blog post. Missed opportunity.

tmaly · on May 5, 2023

There was an open source vector database that was suppose to be an alternative to pinecone that was mentioned a few weeks back.

I can’t seem to find it. Does anyone recall the name?

colordrops · on May 5, 2023

Pgvector for postgres and vector search in Elasticsearch would be two ways to do it without adopting some new untested product.

fzliu · on May 5, 2023

Milvus (https://milvus.io) and Vespa (https://vespa.ai) are great choices if you're looking for hardened, scalable, and production-ready vector databases.

We (Milvus) also have `milvus-lite` if you'd like something pip installable: python3 -m pip install milvus

jn2clark · on May 5, 2023

Not sure which one, but if you are after a vector search engine (not just a database) then I can recommend this https://github.com/marqo-ai/marqo. Includes inference, transformations, schema's, multi-modal search, multi-modal queries, multi-modal representations, text chunking and more.

jn2clark · on May 5, 2023

This article also has some other methods for hallucination and reference checking by using cross-encoders https://github.com/marqo-ai/marqo/blob/mainline/examples/GPT.... It is not perfect but can be easily modified to make it more robust.

peterldowns · on May 5, 2023

Potentially Chroma https://www.trychroma.com/ ?

crazyedgar · on May 5, 2023

CozoDB? https://news.ycombinator.com/item?id=35641164

Xeophon · on May 5, 2023

Popular examples are Weaviate and Qdrant

skilled · on May 5, 2023

Are there any memory-based projects on GitHub that I can try out myself without breaking the bank? Let's say I want to load up a documentation and simply do some queries with it and see how well it performs.

cco · on May 5, 2023

We tried a few different versions of this over a recent hackathon; results weren't great. Though maybe that says more about the documents that we trained it on than the approach!

baq · on May 5, 2023

The confabulation ship probably sailed, but maybe we can still turn it around?

euroderf · on May 5, 2023

Nobody is trying to use Cyc ?