Using vectors for retrieval augmentation seems really powerful at first, but in practice we've found it very finicky. How long should each chunk be? How should you split? Are vector embeddings actually a good space/distance for your domain?
One improvement is to use refinement (ie iterate over many relevant chunks, refining your answer). This is more expensive and slower but less lossy than vectors. LlamaIndex does refinement well.
Would be curious if people have found better patterns. We've been playing around with using one LLM chain to do the retrieval, then passing the retrieved chunks into the main LLM call. Seems to work better for certain contexts.
I posed a similar question to a postdoc at CSAIL. Namely that generative image models like DALL-E consistently screw up the number of eyes and fingers. I wondered if you could use a knowledge graph in conjunction with the drawing algorithm to imbue more realism into the generation. At the time they said they weren't aware of research in that direction. Fair enough. Still interested in seeing if these purely generative models can reference knowledge and apply it.
Human features such as faces and fingers are easy for us humans to parse because our attention is instantly drawn to them, and we are highly sensitive to "noise" for that particular domain. AI-generated content of a three-toed animal, for example, that is pictured with four toes might slip a lot of people's minds.
I unfortunately don't see how vector database would be able to help with this. The knowledge bases that they provide are meant to enable large-scale retrieval and vector search. One way to fix the problem you mention is perhaps to increase the proportion of features that draw human attention (e.g. faces) in the training dataset.
> faces and fingers are easy for us humans to parse because...
because we have dedicated neural hardware for faces -- the "fusiform face area", a relatively small volume of the brain which is used by non-autistic humans for facial recognition/processing. A lot of our sensitivity to human faces is neural structure that's in our DNA, and yes that does pair with a lifelong obsession over human faces to spend huge amounts of energy learning more about them. But after infant brain development we're not relying on a blank slate for human faces.
> One case study of agnosia provided evidence that faces are processed in a special way. A patient known as C. K., who suffered brain damage as a result of a car accident, later developed object agnosia. He experienced great difficulty with basic-level object recognition, also extending to body parts, but performed very well at recognizing faces. A later study showed that C. K. was unable to recognize faces that were inverted or otherwise distorted, even in cases where they could easily be identified by normal subjects. This is taken as evidence that the fusiform face area is specialized for processing faces in a normal orientation.
It seems like generative image generators can’t count very well and we notice whenever the number of something is supposed to be fixed. They also screw up the black keys on a piano.
Generally, the models seem to struggle with fine details that correlate with broader features. Each finger typically looks ok, it's the hand that's off... remote controls might have appropriately detailed buttons with a nonsense layout. I wonder if the training process can "weight" various visual features instead of optimizing loss equally over the whole image.
You're close, but there's something even more fundamentally wrong that goes overlooked. The noise model. Namely the use of gausssian random pixel flips. The spectrum of the noise distribution exactly gets in the way of the fine details before there's any chance to learn it. There's no chance for it to learn the layout of TV remote buttons because in the first iteration of the diffusion there's already noise of the same length scale as the buttons you want it to learn.
I recently watched a discussion with a neuroscientist who was explaining the left brain-right brain difference, as it's currently understood. He said the left hemisphere evolved to focus its attention on a small field of perception, mainly to find food. The right hemisphere evolved to detect threats from the entire field of perception.
The way we train LLMs, to my understanding, is to predict the next word. That's a narrow task, and in my interpretation is a left brain dominant task.
What can happen with the left brain is that it can become so focused on the task at hand, and so wedded to the mental models used to accomplish that task, that it loses touch with reality as broadly perceived by the right brain. The neuroscientist claimed schizophrenia is a disorder of too little inhibition of left brain inferences by the right brain. (I am taking him at his word on this.) The left brain's mental models lose touch with ground truth. This can result in hallucinations.
If I am correct that LLMs act like the left brain in that they are relying on models that they developed during training and focusing on a single task, absent sensory experience, then hallucination may actually be a good term.
I do think that the use of anthropomorphic terms is problematic because it suggests the same phenomenon, rather than an analogous phenomenon.
> The neuroscientist claimed schizophrenia is a disorder of too little inhibition of left brain inferences by the right brain.
This sounds very "bicameral mind"-y[1] to me, and it's worth noting that schizophrenia's causes haven't been pinned down accurately yet. There are a ton of hypotheses around how schizophrenia develops/works, but none are conclusive.
I bring it up because when you hear relatively simple explanations about the left and right brain and the tasks they're "assigned" or "designed" to process, or how their perceived differences contribute to mental illness, it should be taken with a grain of salt.
Similarly, I'd hesitate to compare LLMs or any NNs to actual human brains. Their similarities are entirely superficial, and beyond the very basic topology of NNs that were inspired by specific types of neurons, the similarities end there.
Heavyset_go wrote a good answer from one point of view, I would like to add another. Namely that "hallucination" the word means quite different things for a human and an llm. For a human a hallucination roughly refers to a sensory experience where you experience something that isn't there.
For an LLM it refers to making something up when answering a question. A closer phenomenon to that would be a false memory, but even that isn't quite the same thing.
The thing is that we want LLM's to learn patterns, we want them to generalize from its training data to a certain extent. Hallucinations are basically undesirable instances of this. The user expects to hear about somethign that does exist, but the answer they get is rather about somethign that the LLM has extrapolated might exist.
Outside of the title hallucination is only used ONCE in the whole article. Further still it is now a widely accepted term of art used to describe a specific behavior exhibited by an LLM that is separate from the biological one.
It seems anytime someone brings up these models being "anthropomorphized" they realize they have nothing relevant to say but feel the needs to say something and are falling back on a tired platitude.
Well I think there's a contradiction of wants in having AI that's supposed to be the cutting edge of emulating us yet insisting that it shouldn't be anthropomorphized no matter how close it gets to emulating us.
When we give information we don't know to be true -which is what the model does when it builds low confidence tokens on each other- we call it guessing. Why would this be any different?
Not sure as I’m not yet familiar enough with the types of errors occurring. I do think a somewhat more accurate and possibly technical set of terms would be more useful for discussions about it though than hallucination.
These are token prediction models. Their token predictions are accurate, insofar as the probabilities of those tokens in those sequences being representative of the kind of thing they were trained on.
Is it plausible that a document exists that explains that George Washington was actually an amateur magician? How is a GPT model to know that that is not the document it is predicting the next tokens for? Why wouldn't it explain that he used to perform for children's parties, and do tricks involving making doves and rabbits appear?
Exactly right. These are statistical models that operate at a completely different level of abstraction from the human brain: tokens/words/etc for text models, and individual (or perhaps small groups) of pixels for image models. Plausible output is obtained through sheer brute-force, and there is no mechanism to correct what we perceive as errors.
Yes I agree now. "High loss example" while not catchy might be closer. Although it is a bit backwards as you only think about creating the target once you sense the answer is not great.
1. Once you use the vector embeddings to grab the most relevant chunks, are you just injecting the actual 400 (in this example) token prose text snippets into the LLM query? So under the hood does that query from the article end up as something like "Who was Benito Mussolini? Please use the following texts to inform your answer: [snippet 1, snippet 2, snippet3]"?
2. I understand the use case for knowledge that isn't cooked into the LLM because it's too recent etc., but I wonder about using it with historic knowledge. I assume all(?) LLMs would have used Wikipedia for training and would therefore already have this Mussolini information from those same articles, so what's the point of priming it with duplicate "external" information? Would that really improve accuracy?
This is just "citing the sources" of the retrieved documents or am I missing something? I was expecting to read some novel approach like Hypothetical Document Embeddings.
1. Chunkify the knowledge database into smallish chunks,
2. When you get a question, find the chunks most similar to the question
3. Prompt the LLM to use [the found chunks] to answer [the question]
If you've ever posted a question to Stack Overflow or a similar site, think about the noise reduction “these posts may be relevant to you” feature. When you're actually asking an interesting and/or open ended question, you get a looooot of false positives. That's in the nature of text similarity search, it really hinges on strong nouns and verbs to anchor the discussion. It's no coincidence that the examples given here are uncommon names like Mussolini and Ketanji.
So while this is not terribly useful, it is interesting that it might severely reduce the hallucination rate, which is a sort of false positive rate. Does it only reduce the false positives for facts that it “knows” via the external database? The really interesting prompts and responses, where you ask it to answer questions about figures who do not exist or data too npew for it to have, see if it still hallucinates, would have been very nice to see in this blog post. Missed opportunity.
Milvus (https://milvus.io) and Vespa (https://vespa.ai) are great choices if you're looking for hardened, scalable, and production-ready vector databases.
We (Milvus) also have `milvus-lite` if you'd like something pip installable:
python3 -m pip install milvus
Not sure which one, but if you are after a vector search engine (not just a database) then I can recommend this https://github.com/marqo-ai/marqo. Includes inference, transformations, schema's, multi-modal search, multi-modal queries, multi-modal representations, text chunking and more.
Are there any memory-based projects on GitHub that I can try out myself without breaking the bank? Let's say I want to load up a documentation and simply do some queries with it and see how well it performs.
We tried a few different versions of this over a recent hackathon; results weren't great. Though maybe that says more about the documents that we trained it on than the approach!
One improvement is to use refinement (ie iterate over many relevant chunks, refining your answer). This is more expensive and slower but less lossy than vectors. LlamaIndex does refinement well.
Would be curious if people have found better patterns. We've been playing around with using one LLM chain to do the retrieval, then passing the retrieved chunks into the main LLM call. Seems to work better for certain contexts.