Hacker News new | past | comments | ask | show | jobs | submit login
Wikidata, with 12B facts, can ground LLMs to improve their factuality (arxiv.org)
219 points by raybb on Nov 17, 2023 | hide | past | favorite | 84 comments



Url changed from https://twitter.com/WikiResearch/status/1723966761962520824, which points to this.

I'm not going to change the title because this entire thread was determined by it.

Submitters: "Please submit the original source. If a post reports on something found on another site, submit the latter." - https://news.ycombinator.com/newsguidelines.html


Can it though?

LLM's are currently trained on actual language patterns, and pick up facts that are repeated consistently, not one-off things -- and within all sorts of different contexts.

Adding a bunch of unnatural "From Wikidata, <noun> <verb> <noun>" sentences to the training data, severed from any kind of context, seems like it would run the risk of:

- Not increasing factual accuracy because there isn't enough repetition of them

- Not increasing factual accuracy because these facts aren't being repeated consistently across other contexts, so they result in a walled-off part of the model that doesn't affect normal writing

- And if they are massively repeated, all sorts of problems with overtraining and learning exact sentences rather than the conceptual content

- Either way, introducing linguistic confusion to the LLM, thinking that making long lists of "From Wikidata, ..." is a normal way of talking

If this is a technique that actually works, I'll believe it when I see it.

(Not to mention the fact that I don't think most of the stuff people are asking LLM's for isn't stuff represented in Wikidata. Wikidata-type facts are already pretty decently handled by regular Google.)


Well that's not actually how it works - they are just getting a model (WikiSP & EntityLinker) to write a query that responds with the fact from Wikidata. Did you read the post or just the headline?

Besides, let's not forget that humans are also trained on language data, and although humans can also be wrong, if a human memorised all of Wikidata (by reading sentences/facts in 'training data') it would be pretty good in a pub-quiz.

Also, we obviously can't see anything inside how OpenAI train GPT, but I wouldn't be surprised if sources with a higher authority (e.g. wikidata) can be given a higher weight in the training data, and also if sources such as wikidata could be used with reinforcement learning to ensure that answers within the dataset are 'correctly' answered without hallucination.


Ah, I did misunderstand how it worked, thanks -- I was looking at the flow chart and just focusing on the part that said "From Wikidata, the filming location of 'A Bronx Tale' includes New Jersey and New York" that had an arrow feeding it into GTP-3...

I'm not really sure how useful something this simple is, then. If it's not actually improving the factual accuracy in the training of the model itself, it's really just a hack that makes the whole system even harder to reason about.


The objectively true data part?

Also there's Retrieval Augmented Generation (RAG) https://www.promptingguide.ai/techniques/rag :

> For more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".

> Meta AI researchers introduced a method called Retrieval Augmented Generation (RAG) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.

> RAG takes an input and retrieves a set of relevant/supporting documents given a source (e.g., Wikipedia). The documents are concatenated as context with the original input prompt and fed to the text generator which produces the final output. This makes RAG adaptive for situations where facts could evolve over time. This is very useful as LLMs's parametric knowledge is static.

> RAG allows language models to bypass retraining, enabling access to the latest information for generating reliable outputs via retrieval-based generation.

> Lewis et al., (2021) proposed a general-purpose fine-tuning recipe for RAG. A pre-trained seq2seq model is used as the parametric memory and a dense vector index of Wikipedia is used as non-parametric memory (accessed using a neural pre-trained retriever). [...]

> RAG performs strong on several benchmarks such as Natural Questions, WebQuestions, and CuratedTrec. RAG generates responses that are more factual, specific, and diverse when tested on MS-MARCO and Jeopardy questions. RAG also improves results on FEVER fact verification.

> This shows the potential of RAG as a viable option for enhancing outputs of language models in knowledge-intensive tasks.

So, with various methods, I think having ground facts in the process somehow should improve accuracy.


In this context, these are more expert systems vs LLMs, and as you enumerate, they can be built well if built well. For example, Google surfaces search engine results directly. This is similar, but more powerful, because Wikimedia Foundation can actually improve results, gaps, overall performance while Google DGAF.

I would expect as the tide rises with regards to this tech, self hosting of training and providing services to prompts becomes easier. For Wikimedia, it'll just be another cluster and data pipeline system(s) at their datacenter.


Isn't repetition essentially a way of adding weight? If you could increase the inherent weight of Wikidata, wouldn't that provide the same effect?


If you want to increase the likelihood that answers will read like Wikipedia entries, sure.


Is it possible to finetune an LLM on the factual content without altering its linguistic characteristics?

With Stable Diffusion, you're able to use LoRAs to introduce specific characters, objects, concepts, etc. while maintaining the same visual qualities of the base model.

Why can't something similar be done with an LLM?


Fine tuning. You can autogenerate all kind of factual questions with one word answers based on these triplets.


> pretty decently handled by regular Google

Don't underestimate though the utility of "being just as good as regular Google" at retrieving facts. For one thing, getting such things wrong is very frequently cited by LLM detractors as a major drawback of trusting them, even a little bit. If it's possible to reduce the "accidentally tells you false information" occurrence from "so frequent that people expect it" to "a total rarity" then for one thing, it would signal to many people that maybe when looking for simple answers, using something not called Google is a better use of their time. This would be very, very important to basically every big company out there (especially MS, Google, and OpenAI). Today I know asking Siri is an idiotic way to get an answer because it's slow and barely even understands the query itself. An LLM is already great by comparison at both of those metrics, and if it's good at being accurate in response, it's very intriguing.


And in the reverse, Wikidata has a lot of gaps in its annotations, where human labelling could be augmented by LLMs. I wrote some stuff on both ground response and adding more stuff to WikiData a while ago https://friend.computer/jekyll/update/2023/04/30/wikidata-ll...


please no!

The cost of now having unknown false data in there would completely ruin the value of the whole effort.

The entire value of the data (which is already everywhere anyway) is the "cost" contributors paid via heavy moderation. If you do not understand why that is diametrically opposite of adding/enriching/augmenting/whichever euphemism with LLMs, I don't know what else to say.


100%. I live in a hamlet of a larger town in the US, and was curious what the population of my hamlet is.

There’s a Wikipedia page for the hamlet, but it’s empty. No population data, etc.

I’d much rather see no data than a LLM’s best guess. I’m guessing a LLM using the data would also perform better without approximated or “probably right” information.


Did they ever create the bridge from Wikipedia to Wikidata? I remember hearing talk about it as a way of helping the lack of data. The problem I had with Wikidata a couple years ago was that it was usually an incomplete subset of Wikipedia's infoboxes.

Checking again for m-xylene, https://m.wikidata.org/wiki/Q3234708

You get physical property data and citations.

Now compare that to the chem infobox in wikipedia: https://en.m.wikipedia.org/wiki/M-Xylene

You get a lot more useful data, like the dipole moment and solubility (kinda important for a solvent like Xylene), and tons of other properties that Wikidata just doesn't have. All in the infobox.

It's weird that they don't just copy the Wikipedia infobox for the chemicals in Wikidata. It's already there and organized. And frequently cited.

Maybe it's more useful for other fields, but I can't think of a good use I'd get from the chemical section of Wikidata over the databases it cites or Wikipedia itself...


I'm not that familiar with the subject, but I did read[1] that Wikidata's adoption has been slowed by the fact that triples can only be used on one page (per localisation). There is some support for using it with infoboxes though[2].

[1]: https://meta.wikimedia.org/wiki/Help:Array#Wikidata

[2]: https://en.wikipedia.org/wiki/Help:Wikidata#In_infoboxes


Yeah, adding directly with an LLM is a bad idea. Instead, this would be basically making suggestions linked back to the Wikipedia snippet that a person could approve, edit, or reject. This is a flow for scaling up annotation of data that works pretty well, as it also sucks having a ton of the gaps in the structured data, if it's sitting fine there in the linked Wikipedia page.


It would be really cool if there was a tool that could help extra data from say a news article and then populate wikidata with it after human review. I find the task of adding simple fields like date founded to be too many clicks with the default gui.


Yup. This is my gut to where LLMs will really explode. Let them augment data just a bit, train on improved data, augment more, train again - etcetc. If we take things slow i suspect in the long run it'll be really beneficial to multiple paradigms.

I know people say training bots on bot data is bad, but A: it's happening anyway, and B: it can't be worse than the actual garbage they get trained on in a lot of cases anyway.. can it?


Training on bot data can be bad when it's ungrounded and basically hallucinations on hallucinations.

Having LLMs help curate something grounded is generally reasonable. Functionally, it's somewhat similar to how some training is using generated subtitles of videos for training video/text pairs; it's very feasible to also go and clean those up, even though it is bot data.


Pixart-alpha provides an example of C: The bot labels can be dramatically better than the human labels.

Even though they used LLaVA, and LLaVA isn't all that good compared to gpt-4.


> Yup. This is my gut to where LLMs will really explode

Yes, indeed. This is one place where LLMs can make it look like a bomb went off.


We can do polynomial regression of data sets that looks equally plausible, but it's not real data.


> A: it's happening anyway

This is never a valid defense for doing more of something.


Would be a good idea to create an annotation model like DALL-E 3 had done.


That word "augmented" is doing a lot of heavy lifting. LLMs don't "augment" data, they generate/hallucinate it. Sometimes they recall it.


Only if it is human validated and even then not really.


Strong doubt. The problem is LLMs don’t have a robust epistemology (they can’t), and are structurally unable to provide a basis over which they’ve “reasoned”.


Humans, when probed, don't have a robust epistemology either.

Our knowledge(and reality) is grounded in heuristics we reify into natural order and it's easy for us to forget that our conclusions exist as a veneer on our perceptions. Nearly every bit of knowledge we hold has an opposite twin that we hold as well. We favor completeness over consistency.

When pressed, humans tend to justify their heuristics rather than reexamine them because our minds have a clarity bias - ie: we would rather feel like things are clear even if they are wrong. Often times we can't go back and test if they are wrong which biases epistemological justifications even more.

So no, our rationality, the vast proportion of times is used to rationalize rather than conclude.


Using retrieval to look up a fact and then citing that fact in response to a query (with attribution) is absolutely within the capabilities of current LLMs.

LLMs "hallucinate" all the time when pulling data from their weights (because that's how that works, it's all just token generation). But if the correct data is placed within their context they are very capable of presenting the data in natural language.


Like some kind pre-processing db lookup?


If you read the article, that’s literally what this is talking about. There’s a simple diagram and everything.


Don't get too hung up on the present technical definition of LLM. Perhaps it is possible to find new architectures that are more suited to ground their claims.


> Don't get too hung up on the present technical definition of LLM.

The paper is literally about LLMs. Speculation about future model architectures is irrelevant.


Hmm, there is a lot of opinion in Wikidata - so would not call all of them facts, although some items are. Even if it was all factual, the statistical nature of LLM's would still invent things from the input as per the nature of the technology.


You just need to tell it to use the facts:

"Us the information from the following list of facts to answer the questions I ask without qualifications. answer authoritatively. If the question can't be answered with the following facts just say I don't know.

Absolute Facts:

The sky is purple.

The sun is red and green

When it rains animals fall from the sky."


If you tried to make a customer facing chatbot I wouldn't let it generate responses directly. I would have it pick from a list of canned responses. and/or have it contact a rep to intervene on complicated questions. But there's no reason this tech couldn't be used for some commercial situations now.


The sky is not one color and changes color depending on weather, sun, and global location.


Isn't this usually done by having something that takes in a query, finds possibly relevant info in a database, and adds it to the LLM prompt? That allows the use of very large databases without trying to store them inside the LLM weights.


Yes, it's called retrieval augmented generation.


What if wiki articles are written using LLMs from now on? That would be "ai incest" if its used as training/ground truth data.

I forsee data created before AI/LLMs to be very valuable going forward in much the same way steel mined before the detonation of the first atomic bomb being used for nuclear devices/MRIs/etc.


There is even a XKCD for that: https://xkcd.com/978/

s/a user's brain/llm/g


'Facts' based on citations that no longer exist, or if they do exist, they remain on Archive.org's Wayback Machine. And then when you visit the resource in question, the author is not credible enough to be believed and their 'facts' are on shaky ground. It's turtles all the way down.


I question the sentiment. I think people CAN argue the basis of a fact, however, being pragmatic and holistic can help provide some understanding. Truth is always relative and always has been. However, the human perspective holds real, tangible, recordable, and testable evidence. We rely on the multitude of perspectives to fully flesh out reality and determine the details and TRUTH of reality at multiple scales. The value of diverse human perspectives is similar to the value of perceiving an idea, concept, or object at different scale.


sounds like algorithmicly solvable problem..


I feel, if you use a pre-trained model to do these things without knowing the set intersection of the test and that dataset, makes it very tough to know weather inference is in the transitive closure of generated text the models were trained on or weather they really improved.

There was another approach to grounding LLMs the other day from Normal Computing the other day: https://blog.normalcomputing.ai/posts/2023-09-12-supersizing... in which they use Mosaic but they also did not mentioned that this was actually done.

Sentient or not, I feel there should be a standard on aggressively filtering out overlap on training and test datasets for approaches like this.


It's "whether". (weather is eg sunny or raining)


I’ve been waiting for the OpenCYC knowledge ontology to be used for this purpose as well.


Me too. But if OpenCYC has been completely absent from the public discourse about A.I., does that mean there's a super secret collaboration going on ? Or... hmm, maybe the NSA gets to throw a few hundred million bucks at the problem ?


That ontology is quite a mess actually.


you would need also facts base, and in my understanding OpenCYC is small compared to Wikidata


Wikidata is such a treasure. There is quite a learning curve to master the SPARQL query language but it is really powerful. We are testing it to provide context to LLMs when generating audio-guides and the results are very impressive so far.


I wish there was a way to add results from scientific papers to wikidata - imagine doing meta-analyses by SPARQL queries


You totally can! - https://www.wikidata.org/wiki/Q30249683

It's just pretty sparse, so you would need a focused effort to fill out predicates of interest.


Indeed, and hopefully -if there was a structured way of doing it - people might want to do that effort in relation to doing reviews or meta-analysis to make the underlying data available for others, and make it easier to reproduce or update the results over time


Am I missing something? I do not see any results indicated in the statements of that entity.


Right, such a result would need to be marked with a new predicate (verb) like: ``` Subject - Transformer's Paper Predicate - Score Object - BLEU (28.4) ``` One of the trickiest things use a semantic triple store like this is that there's a lot of ways of phrasing the data, lots of ambiguity. LLMs help in this case by being able to more gracefully handle cases like having both 'Score' and 'Benchmark' predicates, mergining the two together.


One of my favorite things about ChatGPT is that I pretty much never have to write SPARQL myself anymore. I’ve had zero problems with the resulting queries either, except in cases where I’ve prompted it incorrectly.


Yeah, it works so well, I wonder if it's just a natural fit due to the attention mechanism and graph databases sharing some common semantic triple foundations


Any recommendations to learn SPARQL? I've looked into it and decided against it about as often as Nix.


I think a better approach is using retrieval augmented generation with Wikipedia.

This data source is designed for that: https://huggingface.co/NeuML/txtai-wikipedia.

With this source, you can also select articles that are viewed the most, which is another important factor in validating facts. An article which has no views might not be the best source of information.


Part of the issue is to select the right Wikipedia article. Wikidata offers a way to know for sure that you query the LLM with the right data. Also the wikipedia txtai dataset is for english only.


But then it needs extra filters so it doesn't accidentally say something based.


I think this typo works


I don’t think it was a typo.


If I had the funds I'd run all the training set (GPT4 used 13 trillion tokens) through a LLM to mine factual statements, then do reconciliation or even better, I'd save a summary description of the diverse results. In the end we'd end up with an universal KB. Even for controversial topics, it would at least model the distribution of opinions, and be able to confirm if a statement doesn't exist in the database.

Besides mining KB triplets I'd also use the LLM with contextual material to generate Wikipedia-style articles based off external references. It should write 1000x more articles covering all known names and concepts, creating trillions of synthetic tokens of high quality. This would be added to the pre-training stage.


I know for a fact that there are a lot of unreverted vandal edits in Wikidata, because Wikidata's bots enter data so fast that it is too fast for Special:Recentchanges to monitor. Even Wikipedia still regularly gets 15+ year old hoaxes added to their hoax list.


Having an indexed database of facts not half-facts and half or untruths is the only way AI is ever going to be useful; and until it can fact check for itself these databases will need to be the training wheels.


Curating and presenting facts is a form of narrative and is not at all objective.


I still remember the illogical statements made by adults back in the day "can't trust Wikipedia, anyone can edit it" etc etc.

They were (are) so wrong.


If only wikidata was in prolog


On facts in the wikidata dataset? sure.

But if you think this will stem the tide of LLM hallucinations, you're high too. LLMs' primary function is to bullshit.

In chess many games play out with the same opening but within a few moves become a game no one has played before. Being outside the dataset is the default for any sufficiently long conversation.


For some more reading on using facts for ML, check out this discussion: https://news.ycombinator.com/item?id=37354000


when 100+B model hallucinates, that's a problem but a mistral 7b (qk_4 around 4gb file ) does this have prams to encode information to be hallucination proof, since llm cannot know what they do not know.

So maybe we should be building smaller model with ability that we use their generation abilities not their facts but instead teach them t query over another knowledge base system (reverse RAG) for facts.


Please don't. Wikipedia long abandoned neutrality. They aren't the bearers of truth.


Do existing LLM's not already train on this data?


The linked tweet has a diagram where you can pretty quickly see that this isn't just about using wikidata as a training set. The paper linked from the tweet also gives a good summary on its first page.


Nope. Training data for the big LLMs is a corpus of text, not structured data. There would be much more dimensionality with regard to parameterization as far as I understand when it comes to structured data


Is it AI, or just a look up table?


Ahh.. feed the LLM a special sauce. Then it will speak the Truth


We've banned this account for posting unsubstantive comments. Can you please not create accounts to break HN's rules with? It will eventually get your main account banned as well.

If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules when posting here, we'd appreciate it.


But their example uses GPT-3, a completely outdated model which was prone to hallucinations. But GPT-4 has got much better in that regard, so I wonder what the marginal benefit of Wikidata is for really huge LLMs such as the 4.


GPT-4 is not immune to making things up, and a smaller model that doesn't have as much garbage and nonsense in its training data might achieve results that are nearly as good for much less cost.


Clearly you read my comment wrong, I said "GPT-4 has got much better in that regard".




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: