I can see this being useful iif the content is generated on demand and then discarded.
Publishing AI generated material is generally speaking a horrible idea and does nobody any good (at least until accuracy levels get much much better.)
Even if they do it well and truthfully (which they don't) current LLMs can only summarize, digest, and restate. There is no non-transient value add. LLMs may have a place to help query, but there is no reason to publish LLM regurgitations alongside the ground truth used to generate them.
I think bootstrapping documentation with LLM output is a great practice. It's a wiki, people can update it from a baseline, just as long as they can see what was LLM generated to know that it shouldn't be taken as absolute truth.
The hardest part of good documentation is getting started. Once there are docs in place it's usually much easier to revise and correct than it would have been to write correctly by hand the first time. Think of it like automating a rough draft.
Maybe the generated text could be a slightly different colour until it's verified. But you'd have to make sure there's no easy way of verifying everything mindlessly without having read it.
Technique I've found helpful personally: get the LLM to generate text in small chunks (e.g. a paragraph at a time). After generating each chunk, it is immediately reviewed by a human, who can edit it manually to correct any mistakes, ask the LLM to try again, or prompt the LLM to make specified changes. When the human is satisfied with that chunk, it is saved, and we move on to the next one.
Sometimes, the output the LLM generates is correct and I'm just approving it. Other times, it is mostly right, and I can easily identify and correct its errors. Yet other times, it is totally wrong, but often typing out why it is wrong is a good start to actually generating correct text. Often (but not always), the kinds of false assumptions which LLMs make are similar to those a human reader would make, so stuff like "A and B sound very similar but, in the context of this system, actually have completely different meanings" is a useful addition to documentation anyway.
The worst case scenario is the LLM generates something which is subtlety wrong, and the human review fails to pick up on the subtle error. But, that's something which can happen even with no LLMs involved at all. It isn't uncommon for people to make subtle errors in documents they write (often because they are misremembering something) and for those subtle errors not to be picked up during the review process. I'm not convinced the odds of this happening with an LLM-assisted workflow are significantly greater than with a purely human workflow.
current LLMs can only summarize, digest, and restate. There is no non-transient value add.
Though, at a stretch, Wikipedia itself could be considered based around summarization, digesting, and restating/citing things said elsewhere, given its policy of verifiability: "Even if you are sure something is true, it must have been previously published in a reliable source before you can add it." Now, LLMs aren't well known for their citation skills, to be fair.. :-)
I remember one time, I wanted to update a Wikipedia article with some more recent developments, but was having trouble working out the best way of wording them. I found a newspaper article discussing those developments, so I scribbled down some notes summarising the article's content, and then asked ChatGPT to reword them for me more eloquently. And its output was good enough for me to paste in to the Wikipedia article with only minor adjustments, along with a cite to the newspaper article. I'm sure I'm not the only person to have done something like that.
Yeah, when AIs can comprehensively cite their sources I might change my opinion on that.
Though note that there still isn't any need to publish static content. The power of LLMs is that they can be dynamic and responsive!
Even if we hypothesize that it were possible for a LLM to write a high-quality wikipedia-like output, generating the whole thing statically in advance like existing Wikipedia would be relatively pointless. It'd be much more interesting to generate arbitrary (and infinite!) pages on demand.
excellent virtue signalling here -- however, commercial publishers, competitive attorneys, advertising sales and others are literally falling over themselves in an avalanche of doing exactly this, that you advise against (politely).
This moment reminds me very much of the late 1990s when it was common knowledge that "claim jumping a domain name" is very rude and not advisable, or the common knowledge among intellectuals that "ads will ruin the Internet" .. yes, polite people did not make companies to claim jump domain name registration, or push annoying and repetitive ads on the Internet..
I would love for something like this to be attached to LibGen where it reads the millions of scientific papers. As in my opinion human knowledge today is more than what a group of people can handle let alone individuals. Their is lot of domain specific knowledge that would translate and be used in other domains but unless by chance a human with speciality in both domains sees it will not get ported or assimilated in the 2nd domain.
Obviously LLMs aren't deterministic, but they generally can be modeled as a simple, stateless, function: Given text, generate text. Depending on your views towards wasting energy, why store something that can be "trivially" re-generated?
One reason to NOT want to persist an LLM output is to avoid contaminating the internet with LLM text that could be confused for human text. This is useful for historical reasons (what did people say/think) and for future LLM training purposes.
In the future, there might be other reasons to not persist LLM-generated text. If you store the inputs instead of outputs then it can be "replayed" with a new LLM. I can generate some text via an LLM running on my phone, but if I later wanted a higher-quality output, it'd be easier to re-generate the data on my laptop later (or the cloud, obviously). One related idea that comes to mind is personalization - let the LLM use language each reader is familiar with, or focusing on different bits of info to different people. Maybe if an LLM knows I have a PhD in a certain subject, it'd summarize a research paper differently than it'd summarize it for my neighbor who has different preexisting knowledge.
This is categorically untrue. Publishing material generated like this is going to be generally better than human generated content. It takes less time, can be systematically tested and rigorous, and you can specifically avoid the pitfalls of bias and prejudice.
A system like this is multilayered, with prompts going through the whole problem solving process, considering the information presented, assuring quality and factuality, assigning the necessary citations and documentation for claims.
Accuracy isn't a problem. The way in which AI is used creates the problem - ChatGPT and most chat based models are single pass, query/response type interactions with models. Sometimes you get a second pass with a moderation system, doing a review to ensure offensive or illegal things get filtered out. Without any additional testing and prompt engineering, you're going to run into hallucinations, inefficient formulations, random "technically correct but not very useful" generations, and so forth. Raw ChatGPT content shouldn't be published without significant editing and going through the same quality review process any human written text should go through.
What Storm accomplishes is an algorithmic and methodical series of problem solving steps, each of which can be tested and verified and validated. This is synthesized in a particular way, intended as a factual reference article. Presumably you could insert debiasing and checks for narrative or political statements, ensuring attribution and citation occur for quotations, and rephrasing anything generated by the AI as a neutral, academic statement of fact with no stylistic and artistic features.
This is significantly different from the almost superficial interactions you get with chatbots, unless you specifically engineer your prompts and cycle through similar problem solving methods.
Tasks like this are well within the value add domain of current AI capabilities.
Compared to the absolute trash of SEO optimized blog posts, the agenda driven, ulterior laden rants and rambles in social media, and the "I'm oh-so-cleverly influencing the narrative" articles posted to Wikipedia by humans, content like this is a clear winner in quality, in my opinion.
AI isn't at the point where it's going to spit out well grounded novel answers to things like "what's the cure for cancer?" but it can absolutely produce a principled and legible explanation of a phenomenon or collection of facts about a thing.
This could be achieved by generating embeddings of suitable representations of the categories once, and then embedding the content at runtime, before using some distance metric to find matching categories for the content embedding.
> current LLMs can only summarize, digest, and restate. There is no non-transient value add.
No, you're wrong. LLMs create new experiences after deployment, either by assisting humans, or by solving tasks they can validate, such as code or game play. In fact any deployed LLM gets to be embedded in a larger system - a chat room, a code running environment, a game, a simulation, a robot or inside a company - it can learn from iterative tasks because each following iteration carries some kind of real world feedback.
Besides that, LLMs trivially learn new concepts and even new skills with a short explanation or demonstration, they can be pulled out of their training distribution and collect experiences doing new things. If OpenAI has 100M users and they consume 10K tokens/user/month, that makes for 1 trillion tokens of human-AI interaction rich with new experiences and feedback.
In the text modality LLMs have consumed most of the high quality human text, that is why all SOTA models are roughly on par, they trained on the same data. That means easy time is over, AI has caught up with all human language data. But from now on AI models need to create experiences of their own, because learning from your own mistakes is much faster. The more they get used, the more feedback and new information they collect. The environment is the teacher, not everything is written in books.
And all that text - the trillions of tokens they are going to speak to us - in turn contributes to scientific discoveries and progress, and percolate back into the next training set. LLMs have massive impact at language level on people so by extension on the physical world and culture. They have already influenced language and the arts.
LLMs can create new experiences, learn new skills, and have a significant impact through widespread deployment and interaction. There is "value add" if you look at the grand picture.
The explanation is simple - either learn from past experience which is human text for now, or learn from present time experience which comes from the environment. The environment is where LLMs can do novel things.
I looked into this to see where it was getting new information, and as far as I can tell, it is searching wikipedia exclusively. Useful for sure, but not exactly what I was expecting based on the title.
There are wikipedias in other languages - Maybe this framework could be adapted to translate the search terms, fetch mulitlingual sources, translate them back, and use those as comparisons.
I've found a lot of stuff out through similar by-hand techniques that would be difficult to discover on english search. I'd be curious to see how much differential there is between accounts across language barriers.
As a base for researching the idea, Wikipedia seems like a decent data source.
For broader implementation you would want to develop the approach further. The idea of sampling other-language Wikipedia mentioned in a sibling comment seems to be a decent next step.
Extending it to bringing in from wider sources would be another step. I doubt it would be infallible but it would be really interesting to see how it compares to humans performing the same task. Especially if there were a additional ability to verify written articles and make corrections.
> As a base for researching the idea, Wikipedia seems like a decent data source.
If your goal is to generate a wiki article, you can't assume one already exists. That's begging the question. If you could just search wikipedia for the answer, you wouldn't need to generate an article.
As long as the LLM Moderator deems it safe discourse let the best idea win! I'd love a debate between 2 highly-accurate and context-aware LLMs - if such a thing existed.
Otherwise it would be like reading HN or Reddit debates where 2 egomaniacs who are both wrong continually straw man each other with statements peppered with lies and parroted disinfo, aint got time for that.
> While the system cannot produce publication-ready articles that often require a significant number of edits, experienced Wikipedia editors have found it helpful in their pre-writing stage.
So it can't produce articles that require many edits? Meaning it can produce publication-ready articles that don't need lots of edits? Or it can't produce publication-ready articles, and the articles produced require lots of edits? I can't make sense of this statement.
An AI assistant app that mixes AI features with traditional personal productivity. The AI can work in the background to answer multiple chats, handle tasks, and stream/feed entries.
I don’t know how well this works (demo is broken on mobile), but I like the idea.
Imagine an infinite wiki where articles are generated on the fly (from reputable sources - with links), including links to other articles (which are also generated) etc.
I actually like this sort of interface more than chat.
From my experiments, this thing is pretty bad. It mixes up things that have similar names, it pulls in entirely unrelated concepts, the articles it generates are mind-numbingly repetitive and verbose (although notably with slightly different "facts" each time things are restated), its citations are often completely unrelated to the topic at hand, and facts are cited by references that don't back them up.
I mean, the spelling and syntax of the sentences is mostly correct, just like any LLM content. But there's ultimately still no coherence to the output.
I guess this is a good thing for increasing coverage of neglected areas. But given how cleverly LLMs can hide hallucinations, I feel like at least a few different auditor bots should also sign off on edits to ensure everything is correct.
What's the point of a tool that helps you research a topic if said tool has to approve your topic first? It refused to research my topic because it was sensitive.
I saved a full snapshot of Wikipedia (and Stack Overflow) in the weeks before ChatGPT launched, and every day I'm more glad that I did. They will become the Low Background Steel of text.
You can just download it yourself. Wikimedia publishes regular dumps in easily accessible formats: https://dumps.wikimedia.org/enwiki/20240320/ (the most recent for english Wikipedia)
The thing is that the Wiki mods will need to be more diligent with uncited things. I also see 2 massive opportunities here. First is that they can have agents check the cited source and verify whether the source backs up what's said to a reasonable degree. Second opportunity is fitting in things only found in other language Wikis that either be incorporated into the english one or help introduce new articles. Believe it or not, LLMs can't generate english answers for things answered only in Russian (or any language) in the training data.
> First is that they can have agents check the cited source and verify whether the source backs up what's said to a reasonable degree.
This is a hard and tmk unsolved NLP/IR problem, and data access is an issue.
> Second opportunity is fitting in things only found in other language Wikis that either be incorporated into the english one or help introduce new articles.
This has been attempted via machine translation in the past, and it failed because you need native speakers to verify and correct the translations and this wasn't the sort of work that people were jumping to volunteer to do.
I speak multiple languages. It'd be fun to be given a link saying "this article has content present that is not in the English version" and see if I can update either of the articles using what's in the other copy. But I'm not going to go read wikipedia articles in two languages to find them.
>>LLMs can't generate english answers for things answered only in Russian in the training data.
For multilingual LLM’s? Why do you think that?
An LLM can translate inputs of arbitrary Russian text. If there were an English question about something only in the training data as Russian, I would expect an answer - with the quality being on par with its general translation capabilities.
You know that wikipedia keeps revisions on all articles. I'm sure you could put together a script to make a copy any time of each page from a certain point of time.
this is important as it collects and reports its references. a) it’s the correct paradigm for using llms. b) through human interactions, it can learn from its mistakes.
Oh dear lord .... sub heading states - Storm - Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models
Good luck with this storm, wiki's the world over.
Just a thought but ... maybe someone should ask an org like the Internet Archive to snap-shot Wikipedia asap and label it Pre-Storm and After-Storm
LLM mediocrity is just a reflection of human mediocrity, and my bet is on the average LLM to get way better much faster than the average human doing the same.
Agree with you, but on mediocrity: Mistral barely passes as usable, GPT-4 is barely better than Googling, and nothing else I've tried is even ready for production. So there's some element of the model's design, weights/embeddings, and training data that matters a lot.
Only fine-tuned models are producing impressive work, because when we say something is impressive it by definition means not like the status quo - the model must be tuned toward some bias or other, whether it's aesthetic or otherwise, in order to stand out from the rest. And generic models like GPT or Stable Diffusion will always be generic, they won't have a bias toward certain truths - they'll be mostly unbiased which we want for general research or internet search.
So it's interesting, in order to get incredible quality of work out of AI, you have to make it specific, but in order to that, you have to train it on the work of humans. I think for this reason AI will always be ultimately behind humans, though it of course will displace a lot of work we do, which is significant.
On the one hand, a tool is as good or bad as the person wielding it. Smart folks with the right intentions will certainly be able to use this stuff to increase the rate and quality of their output (because they're smart, so they'll verify rather than trust. Hopefully.)
On the other, moderation is an unsolved problem. The general mess of the internet is probably not quite ready to be handed a footgun of this caliber.
As with many things tech, some of the outcome falls to us, the techies. We can build systems to help steer this.
To be clear - I'm with you that these systems can absolutely be a force for vast good (at least, I think that was what you were getting at unless there was a missing '/s'). I use them daily to pretty astounding effect.
I'll admit to being a little put off by being labeled dogmatic - it's not something I consider myself to be.
it was a half sentence, for that I apologize. and I don't remember entirely what I meant.
However, I do see a lot of one-sentence "truthms" being thrown around. like "garbage in; garbage out" and the likest.
these are not correct. we can just look at the current state of the art with LLMs that has vast amounts of garbage going in - it seems like the value is in the vastness of the data over the quality.
> On the one hand, a tool is as good or bad as the person wielding it.
I see this as being a dogme. smart people make good LLMs dumb people do not. but this is an open question. it seems like the biggest wallet will be the winner of the LLM game.
Ah, I see. What I meant by that was, "A tool is as good or evil as the person wielding it".
There are most definitely good and bad tools, in terms of more or less effective. Machine learning models are for sure outclassing a whole swath of tools in a number of domains and will more than likely continue to overtake purposes over time.
Whether this is a good thing for society is what I thought we were questioning - which is what I meant by steering. We can build tooling to do things like establish veracity, enable interrogation of models, and provide reasoning about internals (which we should do).
Open sourcing as much of this effort as possible will further lead to Good Things (because why are people working on these for free if they're not creating something of actual use) - whilst leaving all ML development to large corporations will inevitably ensure that the only thing you can trust an ML model to do would be spy on you and try to get you to buy stuff, because money.
God the grammar in that last sentence was terrible. But I think you get the point.
And even if we accept the premise (as flawed as it might be) that Ai is not able to create original knowledge, most of what's online is dessimination and dies not represent new information but just old information rewritten to be understandable by a certain segment.
just like my first teachers said I should absolutely not use Wikipedia.
LLMs was popularized less than 2 years ago.
I think it is safe to assume that it will be as trustworthy as you see Wikipedia today, and probably even more as you can embed reasoning techniques into the LLMs to correct misunderstandings.
There's an important difference between wikipedia and the LLMs that are actually useful today.
Wikipedia is open, like completely open.
GPT is not.
Unless we manage to crack the distributed training / incremental improvement barriers, LLMs are a lot more likely to follow the Google path (that is, start awesome and gradually enshittify as capitalist concerns pollute the decision matrix) than they are the Wikipedia path (gradual improvement as more eyes and minds work to improve them).
it also carves I to the question what constituted model openness?
most people agree that just releasing weights are not enough.
but I don't think it will ever be feasible to say that reproducing model training is feasible. especially when factoring in branching and merging of models.
for me this is an open and super interesting question.
Here's what I envision (note: impossible with current state of the art)
A model that can be incrementally trained (this is the bit we're missing) hosted by a nonprofit, belonging to "we the people" (like wikipedia).
The training process could be done a little like wikipedia talk pages are now - datasets are proposed and discussed out in the open and once generally approved, trained into the model.
Because training currently involves backpropagation, this isn't possible. Hinton was working on a structure called "forward-forward" that would have overcome this (if it worked) before he decided humanity couldn't be trusted [1]. It is my hope that someone smarter than me picks up this thread of research - although in the spirit of personal responsibility I've started picking up my old math books to try and get to a point where I grok the implementation enough to experiment myself (I'm not super confident I'm gonna get there but you can't win if you don't play, right?)
It's hard to tell when (if?) we're ever going to have this - if it does happen, it'll be because a lot of people do a lot of really smart unpaid work (after seeing OpenAI do what it did, I don't have a ton of faith that even non-profit orgs have the will or the structure to pull it off. Please prove me wrong.)
How could it? LLMs hallucinate false information. Even if hallucinations are improved, the false information they've generated is now part of the body of text they will be trained on.
I mean, putting a bullet to someone's head can extirpate a brain tumor they hadn't been alerted to before, while leaving a grateful person owing you kudos. What if?
The concern is not just a vaguely cynical hand-wringing about how bad AI is. Feeding AIs their own output as training material is a bad thing for mathematical reasons, and feeding AIs the output of other very similar AIs is close enough for it to also be bad. The reasons are subtle and hard to describe in plain English, and I'm not enough of an expert to even try, so pardon if I don't. But given that it is hard to determine if output is from an AI, AI really does face a crisis of having a hard time coming across good training material in the future.
>Feeding AIs their own output as training material is a bad thing for mathematical reasons
Most model collapse studies explore degenerate cases to determine the potential limits of the training process of the same model. No wonder you will get terrible results if you recursively recompress a JPEG 100 times! In real world it's nowhere near that bad, because models are never trained on their output alone and always guaranteed to receive the certain amount of external data, starting from the manual dataset curation (yes, that's also fresh data in itself).
Meanwhile, synthetic datasets are entirely common. I suspect this is a non-issue that is way overblown by people misinterpreting these studies.
I suspect it's overblown today. Hopefully it'll be overblown indefinitely.
However, if AIs become as successful as Nvidia stock price implies, it could indeed become difficult to find text that is guaranteed to not be AI. It is conceivable that in 20 years it will be very difficult to generate a training set at any scale that isn't 90% already touched by AIs.
Of course, it's conceivable that in 20 years we'll have AIs that don't need the equivalent of millennia of training to come up to their full potential. The problem is much more tractable if one merely needs to produce megabytes of training data to obtain a decent understanding of English rather than many gigabytes.
I'd go with "no", because people just consuming the output of other people is a big ongoing problem. Input from the universe needs to be added in order to maintain alignment with the universe, for whichever "universe" you are considering. Without frequent reference to reality, people feeding too much on people will inevitably depart from reality.
In another context, you may know this as an "echo chamber". Not quite exactly the same concept, but very, very similar.
I do like to remind people that the AI of today and LLMs are not the whole of reality. Perhaps someday there will be AIs that are also capable of directly consulting the universe, through some sort of body they can use. But the current LLMs, which are trained on some sort of human output, need to exclude AI-generated input or they too will converge on some sort of degenerate attractor.
yep, then we are back a "vaguely cynical hand-wringing about how bad AI is."
currently we have mostly LLMs in the mix. but there are no reason that the Ai mix will not contain embodied agents thst also publish stuff in the internet. (think search and rescue bots that automatically write a report).
Now Ai is connected to reality without people in the mix.
Hmm something about this title containing the word 'research' disturbs me. I associate that word with rigorous scientific methods that leads to fact based knowledge or maybe some new hypothesis, not some LLM hallucinating sources, references, quotes and all the other garbage they spit out when challenged over a point of fact. Horrifying to think peeps might turn towards these tools for factual information.
This anthropomorphism really bothers me. These tools are useful for what they’re good for, but I really dislike the agency people keep trying to give to them.
It should also bother marketers in the AI industry because it confuses people on what the incredible value is.
So many people think LLM means chatbot, even here on HN. So many people think agent means mentally humanoid.
But we have others, like Stable Diffusion's Web UI and Leonardo.AI - these are just tools with interfaces and the text entry for prompting is not presented as though it's a conversation between 2 people.
Someone shared an AI songmaker here recently... And there's a number of promising RAG tools for improving workflows for: Doctors, mechanics, researchers, lawyers.
I agree with you and expect the "AI character" use case to narrow significantly.
I think there's always been fine line between anthropomorphism as a metaphorical way to indicate complexity versus a pitfall where people (especially outside of a field) start acting like it's a literal statement.
Ex: "the gyroscope is trying to stay upright", or "the computer complains because the update is broken" or "evolution will give the birds longer beaks".
That said, I agree that the problem is dramatically more-severe when it comes to "AI".
Yes, I came to the comments to say the same thing. The LLM is not doing research - it is aggregating data associated with terms and reorganizing text based on what previous responses to a similar prompt would look like.
At the most generous level of scrutiny, the only part that could be related to research would be the aggregation of sources - but that is only a precursor to research and likely is too generalized to be as accurate as a specialist preparing data for actual research.
Publishing AI generated material is generally speaking a horrible idea and does nobody any good (at least until accuracy levels get much much better.)
Even if they do it well and truthfully (which they don't) current LLMs can only summarize, digest, and restate. There is no non-transient value add. LLMs may have a place to help query, but there is no reason to publish LLM regurgitations alongside the ground truth used to generate them.