> There is still an order of magnitude more organic text.
Posing this as a thought experiment, agree we still have more data to go. That we are wondering about this suggests that the current approach may be inadequate, i.e. it should not take petabytes of data for a LLM to match the performance of a high school student (for the LLM = AGI folks).
> One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts.
Agree, KG+LLM is a good next step to explore and should address some hallucination issues (see DRAGON from Leskovec and Liang groups). But we're already now talking about architectural changes as I posited.
In any case, where do we get such knowledge graphs (or index of facts)? Some already exist (e.g. Wiki, UMLS) and were created by humans but are clearly inadequate in coverage.
The proposition of using GPT-like models to generate these (i.e. GraphGPT) seems conceptually flawed as GPT does not itself know if a statement is factual or not which is problematic even for humans.
Settled vs controversial is orders of magnitude more complex, how on earth do we do this without human annotation? You can't rely on frequency (i.e. some things were facts for 100 years but all of a sudden they're not anymore and this is not controversial by definition).
The only reason LLMs work as well as they do now is because sheer volume of data (and NTP) makes the noise seem hidden and by definition an autoregressive model should be somewhat impervious to singular factoids (vs a model being grounded by the garbage dump that is CommonCrawl/the internet).
> At least the model won't hallucinate outside the known facts.
Not sure this is a given, even if a model acts as a natural language database of factoids it is probable that it will hallucinate links unless you're strictly grounding output in which case we've just built a colossally over-engineered IR/STS tool.
> One "simple" application
I think what you've posited is actually harder to build than anything that's been achieved thus far with LLMs.
Posing this as a thought experiment, agree we still have more data to go. That we are wondering about this suggests that the current approach may be inadequate, i.e. it should not take petabytes of data for a LLM to match the performance of a high school student (for the LLM = AGI folks).
> One "simple" application would be to build a full index of facts in the whole training corpus. Just pass each document to GPT and ask it to extract the facts.
Agree, KG+LLM is a good next step to explore and should address some hallucination issues (see DRAGON from Leskovec and Liang groups). But we're already now talking about architectural changes as I posited.
In any case, where do we get such knowledge graphs (or index of facts)? Some already exist (e.g. Wiki, UMLS) and were created by humans but are clearly inadequate in coverage.
The proposition of using GPT-like models to generate these (i.e. GraphGPT) seems conceptually flawed as GPT does not itself know if a statement is factual or not which is problematic even for humans.
Settled vs controversial is orders of magnitude more complex, how on earth do we do this without human annotation? You can't rely on frequency (i.e. some things were facts for 100 years but all of a sudden they're not anymore and this is not controversial by definition).
The only reason LLMs work as well as they do now is because sheer volume of data (and NTP) makes the noise seem hidden and by definition an autoregressive model should be somewhat impervious to singular factoids (vs a model being grounded by the garbage dump that is CommonCrawl/the internet).
> At least the model won't hallucinate outside the known facts.
Not sure this is a given, even if a model acts as a natural language database of factoids it is probable that it will hallucinate links unless you're strictly grounding output in which case we've just built a colossally over-engineered IR/STS tool.
> One "simple" application
I think what you've posited is actually harder to build than anything that's been achieved thus far with LLMs.