Hacker News new | past | comments | ask | show | jobs | submit login
LLMs Are This Close to Destroying the Internet (boehs.org)
77 points by cdme 9 months ago | hide | past | favorite | 106 comments



LLMs are just a red herring. All the way back in 2004 Digg was started with the premise of "there is too much content on the web, lets make a way for users to collaboratively find the good stuff". Variations of this is how most content is consumed today. People still google stuff, but a lot of traffic is driven by successful posts on Facebook, LinkedIn, Twitter, Reddit, or even niche sites like HN, because they serve as a filter for quality content.

The real issue isn't that LLMs make it much easier to generate garbage content, it's that the filter mechanisms we have used for two decades are all failing, and have been massively degrading for a decade now. Just like google is dominated by SEO spam, most of the platforms I mentioned above are overrun by bots and click farms. Upvotes, hearts and reshares can be bought for pennies. Or just buy a preprogrammed bot that posts "subtle" hints to your Onlyfans everywhere.

Of course on an individual level there are strategies to avoid that content. And those won't really be changed by LLMs


For that matter, Yahoo! was so big in the 90s because it was a more or less successful human curation of the Web as it existed at the time that offered better discovery than e.g. WebCrawler. And the first two letters of its name stood for "Yet Another" because it wasn't a new idea then, either.


If it leads to people (read: me) spending less time online, that's probably a good thing (even though it's bad for our industry).


SEO, conventional spam, propaganda, rage bait, click bait, addictionware platforms like Instagram and TikTok, and malware sites already murdered the web. LLMs are just giving it last rights and shoveling in the dirt.


s/rights/rites/g

(Not trying to be snarky or pedantic, "last rights" threw me for a loop until I figured it out.)


I think that's precisely the point of the article...


If you live off of formulaic click bait, you are in trouble.


Did you miss the portion of the article that specifically referenced a literary journal having to stop accepting submissions due to the volume of AI generated entries? Or the deluge of academic research papers that are AI nonsense?


And the part where I say it’s very difficult to search for high quality stuff on the internet.

I don’t want to read clickbait. Whenever I see clickbait, I click off. That means for half my searches I need to kludge through many, many results to find something by a human who actually cares


My hope/goal is simply to get people off the Internet, rather than improve it. Life wasn't bad pre-web. The Internet provided a lot of utility/benefits, but we've slingshotted to using it way, way beyond the point of those benefits. Some pull back and less engagement online is good.


I'm a bit less apocalyptic, but I wonder about this, the "autophagy" TFA refers to.

AI producing and consuming its own output, and the resulting "model collapse", with human beings increasingly out of the loop in thus game of telephone.

I've noticed some "summaries" of content here on HN, posted as comments, with the telltale sign of LLMs. Adding no value whatsoever. Could some day be the case that a large number of posts even here are of AI bots interacting with each other?

I'm not sure about the "rebels" at the end of the article, how will they find each other in sufficient numbers, how will word of mouth spread in sufficient numbers?


Same on twitter (as in the screenshot in the article) -- lots of regurgitative comments, adding words but stating nothing / rephrasing the parent tweet / etc.

I think the next step is actually a big leap backwards - human-curated knowledge bases, islands in the sea of AI-generated noise.


It will be ironically funny if the original Yahoo! model of fuman-curated search results wins in the end.

Also, the LLMs aren't necessarily doomed. Presumably, the humans running them are looking out for the problems of inbreeding or LLMs-eating-their-own-excrement, and will implement some combination of not feeding that crap, and/or freezing the model at a peak point where the results are still primarily based on human creations. Perhaps ChatGPT 4.3 will be the peak?

EDIT: typos


LLM will be quick to catch onto the islands. They’d have to be private islands to fend off the corruption.


Do you mean silos that LLMs and bots cannot access? Like the walled gardens of Slack, Discord, Facebook, etc?

In that case the cure will be worse than the disease, in my opinion. I don't want a siloed internet. Silos exist and are bad enough now, imagine if in the future the only good, human-produced content is secreted away within them.


Yes, they'll most likely be private islands. Honestly, it won't be too bad now that I think about it.


I dont think rebel numbers is an issue. It didnt take long for Facebook to get massive penetration using other modes of communication.

I think this will be a radical change for the better. Once the baseline quality and trust drops low enough, there is as an incentive for curation, walled gardens, and reputation based networks. I hope for the return of webrings, moderated forums, and quality standards for participation.

I also think it may finally break the hold of social media obsession. Outrage loses much of its luster when people realize the other end of it is likely a bot. This is a step toward asking "who is this person, and why do I care what they think."


This is what I was getting at, but you formulated it so astutely!


The only true solution to this is seldom mentioned: Real life. "Word of mouth" can still happen (and should happen) by reals words, uttered in the physical world, by real mouths, in the physical world.

I suspect many of our current societal ills are caused by too much information spreading too quickly, not forced to go through a multitude of brains or stand the tests of time.


> I'm not sure about the "rebels" at the end of the article, how will they find each other in sufficient numbers, how will word of mouth spread in sufficient numbers?

Friends of friends networks, personal recommendations sent over email, communicators, Discord.


Would this scale, when it cannot outcompete garbage content now?


I don't know, but I think we need to give the consumers/recipients of information better tools for discovery, ideally at source instead of disseminating it through intermediaries who exist only to provide real estate for ads.


Yeah, seems like email was already destroyed as anything useful before LLMs kicked in.


An interesting thought: Why don't we ever apply the concept of Model Collapse on the human population i.e. all our complex thoughts just devolve into the same mindless thing?

And even saying that sounds absurd...because it is absurd. We find ways to get new insights from essentially the same human minds from thousands of years ago interacting with each other to produce data.

Why would sufficiently advanced LLMs be any different?


For two reasons:

1. Humans get input from real life, not just other humans. An AI can’t go outside and smell the flowers. As the % of AI input from other AI approaches 100% they will be cut off.

2. Current LLMs are not “sufficiently advanced”.

They even talked about #1 in the article, if you had bothered to read it


Offtopic: the article finishes with a list of links bulleted with, to my eyes, some weird symbols. If anyone else is curious about them, they are Georgian numerals:

https://en.wikipedia.org/wiki/Georgian_numerals#Numeric_valu...


I’m the author, just a fun little stylistic decision. I was looking at the style options for OL and I thought the these both looked cool and had cool history. I quite like culture and embedding little bits of it into my site — another one is if your preferred language is set to Japanese, my name will appear in Katakana, which is typically used to write out English words in Japan


Title says "internet" but blog post only discusses "web".

Practically speaking, it's arguably impossible to create a new internet, i.e., the physical infrastructure for one.

But it's possible to create alternatives to "the" web. And nothing requires alternatives to be anything like the web that LLMs are trained on.


Doesn't basic PageRank for search engines solve the problem of search results filling up with AI junk?

As in, a page should only rank well if the page is linked to from websites that have good reputations.

So it's unlikely reputable news websites are going to link to and boost the search rank of sites full of fake articles and fake images, and any news website that did would quickly start to lose reputation/trust and their own ranking.

Before AI, the web was already full of unlimited junk, spun articles, clickbait and spam, with backlink farms trying to promote it. What's different here?



> The site itself is mostly an AI generated mess, on top of that - its rankings are manipulated by articles like this, ... Most of these articles look like a traditional "link pyramid" network.

So the problem is not being able to detect link pyramids? Where is the link pyramid getting its PageRank reputation from and can't you ban the lot of them once the manipulation is discovered?


I think there could also be the converse problem: quality content hosted on some small website could be hard to discover in the first place, since no one links at it.


It's now even cheaper to build a backlink farm, and telling that the content is artificially generated has become much harder.


Where would these backlink farms get their reputable backlinks from to make this work though?

It's not like e.g. the BBC are going to start linking to AI generated articles without checking their accuracy. And if they did, the BBC would risk their PageRank influence and their own ranking.

I get backlink farms can work right now and Google doesn't strictly follow PageRank, but I'm asking if the PageRank concept used in the right way would help.


Probably from buying up expired domain names that used to host reputable content and that have earned links in the past.


Maybe you then run into a social problem: while the PageRank algorithm would dictate to downrank BBC in that case, would the search engine provider actually be willng to downrank such a popular organization? Think even about places that sperad misinformation and conspiracy theories. Those host poor quality content, but generate traffic so AFAIK don't get penalized in search results.


Nah, the internet will always be there and always be changing. New tech rarely kills old tech, just reduces its usage. Ebooks didn't kill printed books, tv didn't kill radio and so on. People move on to what they like though - you can still buy paper newspapers and vinyl records but many don't. I've read far more "oh no LLMs" content by humans than actual LLM content myself as I don't really like the latter.


If you lower the barriers to producing large volumes of garbage text it follows that you'll get more of it — who wants to read any of it, I don't know. AI generated text isn't interesting to read, the video isn't interesting to watch and the music is painfully boring.


Who wants to read it is Google and Microsoft, to further train their generative models. That's why it's called the Habsburg problem.


The AI industry, bringing innovations to dogfooding.


What does this recipe spam really tell us? It’s a little harder to find a good recipe and it’s more of a hint generator than authoritative advice. But most recipes don’t need to be followed all that precisely. Cooks will make their own adjustments based on local availability of ingredients, what equipment they have, and their own preferences. For an experienced cook, these hints are usually good enough. If you get a somewhat different dish, but it’s still tasty, so what? Authenticity is overrated.

There has long been a similar situation for song lyrics. It doesn’t matter if you get a song’s lyrics exactly right. If you sing it in a little different way, that’s fine.

I’m also reminded of the situation with folk music where there are many variations on any given tune, where the tune came from is lost in history, and it’s fine.

This is how cultural evolution works. The remixes used to be done by people and now they’re increasingly done by machine. LLM’s are new, but Wikipedia is a remix, too, and so are social media and forums like this one.

When you really care about getting it right, you need to go into research mode. Follow the citations and find the primary sources. Stack Overflow is a useful source of hints, but also read the documentation, the source code, and do your own testing.

It’s more work. Most of the time we don’t need to do it because the hints are good enough, and when they aren’t, we can recognize that.

Search engines are still quite useful for research when you find the right keywords. You might need to switch to a more specialized search engine, though? They’re still useful even if they don’t have the enormous cultural impact of the default choice.


The recipe problem mentioned isn't that the recipes are bad but that reading them requires scrolling through pages of autogenerated fake backstory about how the "author" first had this particular paella in Majorca, and that the site's only real goal is to get you to scan that screen space because it also has ads on it, and that now they can essentially make as much of that filler as they want for free.


I wish


What is dead may never die


  That is not dead which can eternal spam,
  And with strange aeons death gives not a damn.


Model collapse doomers are such an odd bunch. It's as if they think that the humans training the models have no agency over the data they use, or how that data is processed/weighted/etc.


Your "hahahaha those doomers" comments is either intentionally cruel or Pollyannish. Your "people have agency" "criticism" doesn't address the key points of the article:

1. The damage in the form of the destruction of news sources and replacement of writers with AI has happened and continues to happen. Writers are already jobless. News sources are already dead. But sure, these jobless people with no hope for their industry are "doomers". The article doesn't mention artists, but they are also getting hammered.

2. The worst players already have agency over their data choices, they are using that to build silos and destroy their competitors who don't have that ability.

3. Some social places have already become unusable. Twitter for instance. Threads may be able to push back on this, but they are already a dangerous silo.


News is a really weird place to start with AI, because it obviously doesn't work. The AI cannot go out into the world and do investigating journalism. It still can't access tons of documents that are still hidden away in archives and libraries around the world and it cannot do an sit though endless hours of court room session (even if that might be a fairly good place to start). The only "news" AIs can report on are press releases, that and rehashed articles written by journalists.

Due to the volume of half-arsed news the current AIs can generate your average reader might not notice that some is missing, though I questions that. People are starting to notice that their newspapers doesn't actually have news anymore (and that's just do to cost optimization and competition from ad supported online news).

I fear that we're entering a world where some of us pay for news written by real journalists, while the masses consume garbage "news" which is more tailored to them clicking ads, rather that learning about the world.


A lot of in-depth news comes from organizations that are funded by paid subscriptions, which gets copied elsewhere. The better websites cite their sources when re-reporting it. So, I guess we can thank the subscribers?

We’re all copying other people’s homework, especially in social media. How much of what you know about the world comes from personal observation? Most people haven’t traveled to most places.


> News is a really weird place to start with AI, because it obviously doesn't work. The AI cannot go out into the world and do investigating journalism.

The news business are almost entirely an ad business now. Only a few sites do real investigative journalism and only with a few journalists.

Many news businesses are now owned or majority controlled by billionaires and have the expected editorial biases that also don't really support hard hitting journalism.


1? I don't care. People lose jobs from technological progress, they've been able to see the writing on the wall for a while, and it's still not even a done deal, so frankly if journalists aren't figuring out how to get paid now, it's kind of on them. If their answer to how do I get paid is lawsuits, they deserve the suffering they're going to get.

2. Open source data sets can have curation too, this seems like a silly strawman to make.

3. Twitter isn't just unusable because of AI. It was never more than barely useable to begin with, and many of users didn't help the situation.


> 1. I don't care. People lose jobs from technological progress, they've been able to see the writing on the wall for a while, and it's still not even a done deal, so frankly if journalists aren't figuring out how to get paid now, it's kind of on them. If their answer to how do I get paid is lawsuits, they deserve the suffering they're going to get.

I got it from your original post: you don't give a shit. I was just calling you out on it. At least you own it.

> 2. Open source data sets can have curation too, this seems like a silly strawman to make.

It's not a strawman, but "curated open source data sets" are a red herring: those data sets are not what is going to be controlling the online experience of the vast majority of those online.


Why should I care about journalists vs anyone else? The pain that journalists suffer will be outweighed by the benefit to society of the tools - assuming people like you don't throw up their hands in defeat and choose to give all their power away to soulless megacorps when choosing which AI tools to use.

News is going to break, and people who break it accurately with personality and a unique take are going to do well regardless if whether newspapers and other bastions of old world journalism continue to exist.


Writing has always been a tough gig and maybe it’s gotten worse, but it doesn’t seem quite that bad? Aren’t some newer writers making money from subscriptions?

From a reader’s point of view, it seems like there’s plenty to read.


And there are plenty of successful writers too. The internet has just enabled unsuccessful ones to complain about it at scale.


Taken literally this comment suggests that OpenAI et al actually want racist 4chan rants in ChatGPT's training data, or illegal child pornography in DALL-E's training data. Obviously that's not the case: the problem is that it's impossible to effectively audit the amount of data required to make generative AI work.

If Big Tech can't even catch pornography or explicit uses of the n-word, there's no way they'll be able to filter subtle LLM / art generator hallucinations.


Going one level deeper, even AI-generated training data which doesn't contain hallucinations or other obvious errors could still negatively affect the performance of a new model. GenAI models have a tendency to repeat the same cliches, in sentence structure or the composition of images or whatever else, so even if their output is "fine" in the sense that it's factually correct and everyone has the right number of fingers, a model trained on it will still internalize those cliches and make them even more likely to show up again. That's even harder to audit out because it would only become apparent in aggregate when you realise your new model has learned to interpret "fantasy artwork" as "fantasy artwork in the generic Midjourney RLHF style" since it was trained on so many examples of that.


That's really just a feature of the default system prompt, although ubiquitous. You can character-prompt local LLMs as you like, but no one seems to do that.


ChatGPT doesn't take character prompts very well, that's why people don't know. Claude on the other hand is a champ with characters.


Not so much that, as OpenAI has finite resources and is allocating them to areas that are more valuable.


Has it been demonstrated that there is illegal CSAM in the DALL-E training?


I don't think it has, if only because OpenAI aren't transparent about where they get their image training data from, but the most widely used public image generation dataset was found to contain CSAM: https://www.theverge.com/2023/12/20/24009418/generative-ai-i...


Why would you need to?

Obviously generalizing over pictures of humans and pictures of porn will enable CSAM generation ...

Plus a LOT of things have been declared illegal pornographic material in various places: all porn (China), drawn CSAM, any kind of drawn porn with unclear ages, ...


> "Why would you need to?"

If someone is going to suggest that there is "illegal child pornography in DALL-E's training data" I think it needs to be backed up - that's a pretty big accusation.


It hasn't been demonstrated in DALL-E, but this was a problem in older OpenAI text-video models, and it's in plenty of CommonCrawl/etc data. There is anecdata from contractors suggesting that DALL-E generated CSAM imagery during RLHF, which is circumstantial evidence that similar images are somewhere in the training data. I think it's fair to say we can't rely on OpenAI's technical competence to remove this stuff from their training data, despite claims to the contrary w.r.t DALL-E 2. (Likewise it seems Copilot Designer churns out CSAM like candy.)

Your attitude is naive and competely at odds with (honest) AI safety research. The rational thing for users is to assume DALL-E 2 is contaminated by CSAM unless OpenAI can publicly demonstrate otherwise. Their leadership is too stupid and dishonest for people to take their word.

https://ieeexplore.ieee.org/abstract/document/9423393

https://arxiv.org/pdf/2110.01963.pdf?trk=public_post_comment...

https://www.theguardian.com/technology/2023/aug/02/ai-chatbo...


You're saying my attitude is "naive". I'm saying that it's irresponsible for you to state that there is CSAM in DALL-E's training data unless you can back that up.

I find OpenAI's refusal to document what's in the training data infuriating, but I that doesn't lead me to assume they weren't able to filter out CSAM from it.

If they don't care about this issue at all, why are they exposing low paid workers in Kenya to these conditions?


I feel like the take there is that they're destroying the internet from the perspective of people who want to make money off of add revenue. Cynically, it's like they want the internet to be kind of bad - as in distraction from ads, begging for money and content arranged to support that, but not so bad it puts people off entirely or drives the incremental value of putting out "content" to zero because of automation.

For people writing on the internet for motivations other than ad-driven monetization, and those who read their stuff, there is basically no issue.


I feel like it's the exact opposite, LLMs are great for ad revenue because you can create a ton of garbage and monetize it in a fraction of the time it takes actual content creators to make their content. Meanwhile, the actual good content is drowned in a sea of garbage and much harder to find


You can only monetize garbage for a short time before people catch on and the entire model is destroyed (the good content along with the garbage).


Grifters rarely think ahead, as noted in the article


Yeah, that was my argument. It’s incredibly difficult to find high quality writing at this point (regardless of if it’s for profit or not). The most successful place for me has been hacker news, but that’s not a search engine.


> For people writing on the internet for motivations other than ad-driven monetization, and those who read their stuff, there is basically no issue.

I doubt these users will not experience a drop in quality as most authorship will roll out LLM laden content which will add more noise to the signal.


> For people writing on the internet for motivations other than ad-driven monetization, and those who read their stuff, there is basically no issue.

This is the key, I think. But it's small--The Small Web. I think we'll be better off, though.

Have a day job and produce quality content in your spare time, is my motto. A few billion people producing content in their spare time is still more than we can ever hope to consume.


That seems like it would mean you're restricted to a dataset from before 2022ish or you have to use data that's significantly more expensive to clean.

That seems like it might make the training process untenable.


Isn't all of recorded human history up until 2022 enough to train a pretty smart ai? It'll miss out on future trends, but some targeted training on trusted news sources should be enough. Eventually languages will change enough that the old training data will seem dated, but that's going to take a while to happen.


Yeah, but basically all of that comes from post 2000.

Information production seems to be on an exponential curve, so it won't be that long before you're training on a minority of all data.


Is it really true though? If they scrape HN today I don’t think their sampling would be that “polluted” compared to 2022.

If you were to teach a LLM to play chess today - wouldn’t you use fully AI generated data because that’s by far the best data ?


Adding to this, the Internet is extremely noisy and polluted with all manner of crap, this has been the status quo for over a decade.

The notion that you could just grab data off the internet completely unvetted, and then train a useful model off that data seems very unrealistic. To get anything useful out of real world crawl data, you need a lot of filtering and processing to select the good bits. This is true if you're training an AI or building a search engine.


They are only humans and will face the same issues that humans trying to control the quality of the search engine results did in the past and those policing social media networks do today. They too have admitted they cannot keep up and started using automated tools a long time ago.


The difference is that there are perverse incentives keeping garbage in search and social. With AI models, the incentive is accuracy.


It will change. Sooner or later AI models will be altered to produce output that embeds advertising/paid content/paid messaging.


There are open source models, so I don't think people are going to be able to get away with that.


Money will choose the model. They need lots of expensive compute power so there will have to be a business model and ad-supported is the way to go on the internet.


The barriers/costs for pushing AI generated noise are so low, I agree there's the potential for it to overwhelm human written content. In a sense it effectively happened to email, again with lots of $doom, but the timescale turned out to be a decade long frog boil. Nobody, or a close approximation to, runs their own email server any more.

In an ironic twist the gatekeepers of the semi-walled gardens/islands of email that arose are now harvesting it for training content. I do so despise those weekly machine generated synopses telling me how I am doing.


Maybe it's the whole processing and weighting thing that gives people cause for concern.


Well then welcome to being a human, because the gatekeepers of knowledge have been doing this since the beginning of time. What do you think the priestly class's original job was?


Slightly related, yesterday someone posted this AI-generated song:

https://www.udio.com/songs/p66uVGEgifEBLdoR5Ttyue

It's not that it's perfect, it's cheesy and commercial and obviously lacking some coherence- but it also has some pretty good parts and... performances... and if passed on radio it would certainly attract attention. A casual listener would hardly imagine that it's AI-generated.

The scariest part are the lyrics: "lorem ipsum dolor sit amet...". Nonsense, just pure mindless nonsense, sung with feeling and expression. I think this is a good example of the things to come.


what radio station plays music like that? it's not hiphop/rap/r&b, it's not country, and it's not mix of 70s/80s/90s. it's also not sports or politics talk. it's also not Tejano. if there's a radio station in your area that plays anything other than what I've listed, you have a much better market than the one I'm in


Enough good for rock stations, probably Americans don't have a one because your guitar music market is full with country and here in Europe we barely recognize this genre at all.


BBC radio plays some cool music late at night.


Sounds like terrible, generic power metal. I don't think we need it generated at scale. Nightwish sans what little made their work interesting.


If you're a lyricist you can use these tools to make good songs that do have a message.

If you're a song maker you can probably use LLMs to put together a decent lyric or help in the process of turning your ideas into something that makes sense.

To think that you just press a button and something comes out of it and you're happy with that, eh, I mean it's possible but that's never what arts been about. You can get lucky, but you can have consistently decent results by just putting some more effort into the creation process.


Yes, the human is still in the loop, at least for choosing the best generations. But I don't think it will be there for long, how old is this tech now? A year, max? Give it a few more.

Then, before deciding that we like a song that we like, we'll have to investigate whether it's real or not. If the emphasis and expression and tonal changes actually correspond to the intents and emotions of a performer or are just created from nothing.


Meh. Uncreative mass-produced music has existed forever.

I don't see this sort of thing as much of a threat to the mainstream music industry, since the mainstream music industry isn't really about selling music, it's about selling concert tickets.


Not at this scale. There was never a time in history when it was so easy to produce SO MANY songs.

Before, you actually had to pay people to compose songs and those people had to sit down at a piano and write the songs, then other people had to perform the songs.

Now, I can buy a certain number (?) of the latest NVIDIA GPUs, build a cluster, and then churn out songs literally 24/7. Endlessly, song after song after song, each minute a new song. Just press the button, and a new song it created -- _no other humans involved_.

Yes, slop, kitsch and mass produced music were always produced. Never at this scale.


Though I should add that AI probably will have a big effect on bespoke and cinematic music. Look forward to hearing lots of this stuff in Hollywood, especially lower-budget TV shows.


We have had non-AI music that fits your entire description for some time now.

See also the k-pop revolution (as stanned by people who don’t understand a word of Korean)


I actually like K-POP because I don't understand a word of Korean.


Yes, that’s why the parent comment is relevant.


LLMs are destroying themselves by adding new censorship daily


[flagged]


Like what?


It's what enables you to even be here to hear OP. Somewhere along the line the magic got lost, in the name of higher level human problems, but it is quite literally one of humanity's most important inventions.


porn, black markets, file exchangers, blockchains


[flagged]


What is the web3's promise? Only the one about undeletedness?


No, it’s about making it easier for users to use useful software and for developers to produce useful software at web-scale.


What is the meaning of "web-scale", do you mean some hosting abilities of blockchain?

Thanks for the laugher BTW, "easier for users who use some useful software" is the answer I expected to receive.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: