LLM's are currently trained on actual language patterns, and pick up facts that are repeated consistently, not one-off things -- and within all sorts of different contexts.
Adding a bunch of unnatural "From Wikidata, <noun> <verb> <noun>" sentences to the training data, severed from any kind of context, seems like it would run the risk of:
- Not increasing factual accuracy because there isn't enough repetition of them
- Not increasing factual accuracy because these facts aren't being repeated consistently across other contexts, so they result in a walled-off part of the model that doesn't affect normal writing
- And if they are massively repeated, all sorts of problems with overtraining and learning exact sentences rather than the conceptual content
- Either way, introducing linguistic confusion to the LLM, thinking that making long lists of "From Wikidata, ..." is a normal way of talking
If this is a technique that actually works, I'll believe it when I see it.
(Not to mention the fact that I don't think most of the stuff people are asking LLM's for isn't stuff represented in Wikidata. Wikidata-type facts are already pretty decently handled by regular Google.)
Well that's not actually how it works - they are just getting a model (WikiSP & EntityLinker) to write a query that responds with the fact from Wikidata. Did you read the post or just the headline?
Besides, let's not forget that humans are also trained on language data, and although humans can also be wrong, if a human memorised all of Wikidata (by reading sentences/facts in 'training data') it would be pretty good in a pub-quiz.
Also, we obviously can't see anything inside how OpenAI train GPT, but I wouldn't be surprised if sources with a higher authority (e.g. wikidata) can be given a higher weight in the training data, and also if sources such as wikidata could be used with reinforcement learning to ensure that answers within the dataset are 'correctly' answered without hallucination.
Ah, I did misunderstand how it worked, thanks -- I was looking at the flow chart and just focusing on the part that said "From Wikidata, the filming location of 'A Bronx Tale' includes New Jersey and New York" that had an arrow feeding it into GTP-3...
I'm not really sure how useful something this simple is, then. If it's not actually improving the factual accuracy in the training of the model itself, it's really just a hack that makes the whole system even harder to reason about.
> For more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".
> Meta AI researchers introduced a method called Retrieval Augmented Generation (RAG) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.
> RAG takes an input and retrieves a set of relevant/supporting documents given a source (e.g., Wikipedia). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static.
> RAG allows language models to bypass retraining, enabling access to the latest information for generating reliable outputs via retrieval-based generation.
> Lewis et al., (2021) proposed a general-purpose fine-tuning recipe for RAG. A pre-trained seq2seq model is used as the parametric memory and a dense vector index of Wikipedia is used as non-parametric memory (accessed using a neural pre-trained retriever). [...]
> RAG performs strong on several benchmarks such as Natural Questions, WebQuestions, and CuratedTrec. RAG generates responses that are more factual, specific, and diverse when tested on MS-MARCO and Jeopardy questions. RAG also improves results on FEVER fact verification.
> This shows the potential of RAG as a viable option for enhancing outputs of language models in knowledge-intensive tasks.
So, with various methods, I think having ground facts in the process somehow should improve accuracy.
In this context, these are more expert systems vs LLMs, and as you enumerate, they can be built well if built well. For example, Google surfaces search engine results directly. This is similar, but more powerful, because Wikimedia Foundation can actually improve results, gaps, overall performance while Google DGAF.
I would expect as the tide rises with regards to this tech, self hosting of training and providing services to prompts becomes easier. For Wikimedia, it'll just be another cluster and data pipeline system(s) at their datacenter.
Is it possible to finetune an LLM on the factual content without altering its linguistic characteristics?
With Stable Diffusion, you're able to use LoRAs to introduce specific characters, objects, concepts, etc. while maintaining the same visual qualities of the base model.
Don't underestimate though the utility of "being just as good as regular Google" at retrieving facts. For one thing, getting such things wrong is very frequently cited by LLM detractors as a major drawback of trusting them, even a little bit. If it's possible to reduce the "accidentally tells you false information" occurrence from "so frequent that people expect it" to "a total rarity" then for one thing, it would signal to many people that maybe when looking for simple answers, using something not called Google is a better use of their time. This would be very, very important to basically every big company out there (especially MS, Google, and OpenAI). Today I know asking Siri is an idiotic way to get an answer because it's slow and barely even understands the query itself. An LLM is already great by comparison at both of those metrics, and if it's good at being accurate in response, it's very intriguing.
And in the reverse, Wikidata has a lot of gaps in its annotations, where human labelling could be augmented by LLMs. I wrote some stuff on both ground response and adding more stuff to WikiData a while ago https://friend.computer/jekyll/update/2023/04/30/wikidata-ll...
The cost of now having unknown false data in there would completely ruin the value of the whole effort.
The entire value of the data (which is already everywhere anyway) is the "cost" contributors paid via heavy moderation. If you do not understand why that is diametrically opposite of adding/enriching/augmenting/whichever euphemism with LLMs, I don't know what else to say.
100%. I live in a hamlet of a larger town in the US, and was curious what the population of my hamlet is.
There’s a Wikipedia page for the hamlet, but it’s empty. No population data, etc.
I’d much rather see no data than a LLM’s best guess. I’m guessing a LLM using the data would also perform better without approximated or “probably right” information.
Did they ever create the bridge from Wikipedia to Wikidata? I remember hearing talk about it as a way of helping the lack of data. The problem I had with Wikidata a couple years ago was that it was usually an incomplete subset of Wikipedia's infoboxes.
You get a lot more useful data, like the dipole moment and solubility (kinda important for a solvent like Xylene), and tons of other properties that Wikidata just doesn't have. All in the infobox.
It's weird that they don't just copy the Wikipedia infobox for the chemicals in Wikidata. It's already there and organized. And frequently cited.
Maybe it's more useful for other fields, but I can't think of a good use I'd get from the chemical section of Wikidata over the databases it cites or Wikipedia itself...
I'm not that familiar with the subject, but I did read[1] that Wikidata's adoption has been slowed by the fact that triples can only be used on one page (per localisation). There is some support for using it with infoboxes though[2].
Yeah, adding directly with an LLM is a bad idea. Instead, this would be basically making suggestions linked back to the Wikipedia snippet that a person could approve, edit, or reject. This is a flow for scaling up annotation of data that works pretty well, as it also sucks having a ton of the gaps in the structured data, if it's sitting fine there in the linked Wikipedia page.
It would be really cool if there was a tool that could help extra data from say a news article and then populate wikidata with it after human review.
I find the task of adding simple fields like date founded to be too many clicks with the default gui.
Yup. This is my gut to where LLMs will really explode. Let them augment data just a bit, train on improved data, augment more, train again - etcetc. If we take things slow i suspect in the long run it'll be really beneficial to multiple paradigms.
I know people say training bots on bot data is bad, but A: it's happening anyway, and B: it can't be worse than the actual garbage they get trained on in a lot of cases anyway.. can it?
Training on bot data can be bad when it's ungrounded and basically hallucinations on hallucinations.
Having LLMs help curate something grounded is generally reasonable. Functionally, it's somewhat similar to how some training is using generated subtitles of videos for training video/text pairs; it's very feasible to also go and clean those up, even though it is bot data.
Strong doubt. The problem is LLMs don’t have a robust epistemology (they can’t), and are structurally unable to provide a basis over which they’ve “reasoned”.
Humans, when probed, don't have a robust epistemology either.
Our knowledge(and reality) is grounded in heuristics we reify into natural order and it's easy for us to forget that our conclusions exist as a veneer on our perceptions. Nearly every bit of knowledge we hold has an opposite twin that we hold as well. We favor completeness over consistency.
When pressed, humans tend to justify their heuristics rather than reexamine them because our minds have a clarity bias - ie: we would rather feel like things are clear even if they are wrong. Often times we can't go back and test if they are wrong which biases epistemological justifications even more.
So no, our rationality, the vast proportion of times is used to rationalize rather than conclude.
Using retrieval to look up a fact and then citing that fact in response to a query (with attribution) is absolutely within the capabilities of current LLMs.
LLMs "hallucinate" all the time when pulling data from their weights (because that's how that works, it's all just token generation). But if the correct data is placed within their context they are very capable of presenting the data in natural language.
Don't get too hung up on the present technical definition of LLM. Perhaps it is possible to find new architectures that are more suited to ground their claims.
Hmm, there is a lot of opinion in Wikidata - so would not call all of them facts, although some items are. Even if it was all factual, the statistical nature of LLM's would still invent things from the input as per the nature of the technology.
"Us the information from the following list of facts to answer the questions I ask without qualifications. answer authoritatively. If the question can't be answered with the following facts just say I don't know.
If you tried to make a customer facing chatbot I wouldn't let it generate responses directly. I would have it pick from a list of canned responses. and/or have it contact a rep to intervene on complicated questions. But there's no reason this tech couldn't be used for some commercial situations now.
Isn't this usually done by having something that takes in a query, finds possibly relevant info in a database, and adds it to the LLM prompt? That allows the use of very large databases without trying to store them inside the LLM weights.
What if wiki articles are written using LLMs from now on? That would be "ai incest" if its used as training/ground truth data.
I forsee data created before AI/LLMs to be very valuable going forward in much the same way steel mined before the detonation of the first atomic bomb being used for nuclear devices/MRIs/etc.
'Facts' based on citations that no longer exist, or if they do exist, they remain on Archive.org's Wayback Machine. And then when you visit the resource in question, the author is not credible enough to be believed and their 'facts' are on shaky ground. It's turtles all the way down.
I question the sentiment. I think people CAN argue the basis of a fact, however, being pragmatic and holistic can help provide some understanding. Truth is always relative and always has been. However, the human perspective holds real, tangible, recordable, and testable evidence. We rely on the multitude of perspectives to fully flesh out reality and determine the details and TRUTH of reality at multiple scales. The value of diverse human perspectives is similar to the value of perceiving an idea, concept, or object at different scale.
I feel, if you use a pre-trained model to do these things without knowing the set intersection of the test and that dataset, makes it very tough to know weather inference is in the transitive closure of generated text the models were trained on or weather they really improved.
Me too. But if OpenCYC has been completely absent from the public discourse about A.I., does that mean there's a super secret collaboration going on ? Or... hmm, maybe the NSA gets to throw a few hundred million bucks at the problem ?
Wikidata is such a treasure. There is quite a learning curve to master the SPARQL query language but it is really powerful. We are testing it to provide context to LLMs when generating audio-guides and the results are very impressive so far.
Indeed, and hopefully -if there was a structured way of doing it - people might want to do that effort in relation to doing reviews or meta-analysis to make the underlying data available for others, and make it easier to reproduce or update the results over time
Right, such a result would need to be marked with a new predicate (verb) like:
```
Subject - Transformer's Paper
Predicate - Score
Object - BLEU (28.4)
```
One of the trickiest things use a semantic triple store like this is that there's a lot of ways of phrasing the data, lots of ambiguity. LLMs help in this case by being able to more gracefully handle cases like having both 'Score' and 'Benchmark' predicates, mergining the two together.
One of my favorite things about ChatGPT is that I pretty much never have to write SPARQL
myself anymore. I’ve had zero problems with the resulting queries either, except in cases where I’ve prompted it incorrectly.
Yeah, it works so well, I wonder if it's just a natural fit due to the attention mechanism and graph databases sharing some common semantic triple foundations
With this source, you can also select articles that are viewed the most, which is another important factor in validating facts. An article which has no views might not be the best source of information.
Part of the issue is to select the right Wikipedia article. Wikidata offers a way to know for sure that you query the LLM with the right data. Also the wikipedia txtai dataset is for english only.
If I had the funds I'd run all the training set (GPT4 used 13 trillion tokens) through a LLM to mine factual statements, then do reconciliation or even better, I'd save a summary description of the diverse results. In the end we'd end up with an universal KB. Even for controversial topics, it would at least model the distribution of opinions, and be able to confirm if a statement doesn't exist in the database.
Besides mining KB triplets I'd also use the LLM with contextual material to generate Wikipedia-style articles based off external references. It should write 1000x more articles covering all known names and concepts, creating trillions of synthetic tokens of high quality. This would be added to the pre-training stage.
I know for a fact that there are a lot of unreverted vandal edits in Wikidata, because Wikidata's bots enter data so fast that it is too fast for Special:Recentchanges to monitor. Even Wikipedia still regularly gets 15+ year old hoaxes added to their hoax list.
Having an indexed database of facts not half-facts and half or untruths is the only way AI is ever going to be useful; and until it can fact check for itself these databases will need to be the training wheels.
But if you think this will stem the tide of LLM hallucinations, you're high too. LLMs' primary function is to bullshit.
In chess many games play out with the same opening but within a few moves become a game no one has played before. Being outside the dataset is the default for any sufficiently long conversation.
when 100+B model hallucinates, that's a problem but a mistral 7b (qk_4 around 4gb file ) does this have prams to encode information to be hallucination proof, since llm cannot know what they do not know.
So maybe we should be building smaller model with ability that we use their generation abilities not their facts but instead teach them t query over another knowledge base system (reverse RAG) for facts.
The linked tweet has a diagram where you can pretty quickly see that this isn't just about using wikidata as a training set. The paper linked from the tweet also gives a good summary on its first page.
Nope. Training data for the big LLMs is a corpus of text, not structured data. There would be much more dimensionality with regard to parameterization as far as I understand when it comes to structured data
We've banned this account for posting unsubstantive comments. Can you please not create accounts to break HN's rules with? It will eventually get your main account banned as well.
But their example uses GPT-3, a completely outdated model which was prone to hallucinations. But GPT-4 has got much better in that regard, so I wonder what the marginal benefit of Wikidata is for really huge LLMs such as the 4.
GPT-4 is not immune to making things up, and a smaller model that doesn't have as much garbage and nonsense in its training data might achieve results that are nearly as good for much less cost.
I'm not going to change the title because this entire thread was determined by it.
Submitters: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html