Get concerned when you see a real product in the market that has a sustainable b...

aperrien · 2025-03-12T21:28:15 1741814895

Man, we shared our knowledge via books long before the internet. And a lot of those AI models train off of thousands of books as a base before they try to incorporate less accurate knowledge from the wild internet. The cat was out of the bag on that long ago.

DebtDeflation · 2025-03-12T21:30:37 1741815037

I saw Musk saying a couple of days ago that we've "hit the limit of peak data" for training AI. My immediate reaction was no, surely you have not trained on every copyrighted textbook on every subject ever written. You hit the peak of easily accessible internet data that you could quickly steal to train your models.

writtenAnswer · 2025-03-12T23:40:16 1741822816

Meta famously used libgen to train, right? That is basically a source for all copyrighted textbooks and more.

imtringued · 2025-03-13T06:58:33 1741849113

You might not know it, but there is no data for AI in robotics.

Everyone has to collect their own data and pool it together or else there won't be any progress.

potatoman22 · 2025-03-12T22:41:05 1741819265

The 82TB Meta trained on is still a lot of textbooks.

greedylizard · 2025-03-13T02:02:41 1741831361

I can’t help but think that’s the real reasons he wants five billets from every federal worker every week. Free, hot, and fresh data!

wombatpm · 2025-03-13T06:12:15 1741846335

Eventually humans ability to create new fresh data will be the justification for UBI. Fo shizzle

CamperBob2 · 2025-03-12T22:10:12 1741817412

It doesn't get any better than this right now.

And it won't ever get any worse.

achierius · 2025-03-12T23:14:59 1741821299

You sure about that? Google Search is backed by a pretty big-serious ML model, and it's gotten a lot worse in just the last few years.

JFingleton · 2025-03-13T08:05:14 1741853114

There are other search engines that are on par with Google search from a few years ago. Brave search is particularly good.

These were developed without the big bucks, so the tech has improved for smaller players at least.

CamperBob2 · 2025-03-13T01:42:25 1741830145

Valid point there for sure.

But yes, in general, models won't get worse than they are now (or if they do, they won't stay that way.) At Google, search has been enshittified for business reasons, not technical ones.

hi_hi · 2025-03-13T03:20:32 1741836032

"enshitification" suggests otherwise

Nathan2055 · 2025-03-12T21:41:00 1741815660

> The actual scary stuff is the dilution of expertise, we contributed for a long time to share our knowledge for internet points (stack overflow, open source projects, etc), and it has been harvested by the AIs already, anyone that pays access to these services for tens of dollars a month can bootstrap really quickly and do what it might had needed years of expertise before.

What scares me more is the opposite of that: information scarcity leading to less accessible intelligence on newer topics.

I’ve completely stopped posting on Reddit since the API changes, and I was extremely prolific before[1] because I genuinely love writing about random things that interest me. I know I’m not the only one: anecdotally, the overall quality of content on Reddit has nosedived since the change and while there doesn’t seem to be a drop in traffic or activity, data seems to indicate that the vast majority of activity these days is disposable meme content[2]. This seems to be because they’re attempting desperately to stick recommendation algorithms everywhere they can, which are heavily weighted toward disposable content since people view more of it. So even if there were just as many long discussion posts like before, they’re not getting surfaced nearly as often. And discussion quality when it does happen has noticeably dipped as well: the Severance subreddit has regularly gotten posts and comments where people question things that have already been fully explained in the series itself (not like subtext kind of things, like “a character looked at the camera and blatantly said that in the episode you’re talking about having just watched” things). Those would have been heavily downvoted years ago, now they’re the norm.

But if LLMs learn from the in-depth posting that used to be prominent across the Internet, and that kind of in-depth posting is no longer present, a new problem presents itself. If, let’s say, a new framework releases tomorrow and becomes the next big thing, where is ChatGPT going to learn how that framework works? Most new products and platforms seem to centralize their discussion on Discord, and that’s not being fed into any LLMs that I’m aware of. Reddit post quality has nosedived. Stack Overflow keeps trying to replace different parts of its Q&A system with weird variants of AI because “it’s what visitors expect to see these days.” So we’re left with whatever documentation is available on the open Internet, and a few mediocre-quality forum posts and Reddit threads.

An LLM might be able to pull together some meaning out of that data combined with the existing data it has. But what about the framework after that? And the language after that? There’s less and less information available each time.

“Model collapse” doesn’t seem to have panned out: as long as you have external human raters, you can use AI-generated information in training. (IIRC the original model collapse discussions were the result of AI attempting to rate AI generated content and then feed right back in; that obviously didn’t work since the rater models aren’t typically any better than the generator models.) But what if the “data wells” dry up eventually? They can kick the can down the road for a while with existing data (for example LLMs can relate the quirks of new languages to the quirks of existing languages, or text to image models can learn about characters from newer media by using what it already knows about how similar characters look as a baseline), but eventually quality will start to deteriorate without new high-quality data inputs.

What are they gonna do then when all the discussion boards where that data would originate are either gone or optimized into algorithmic metric farms like all the other social media sites?

[1]: https://old.reddit.com/user/Nathan2055

[2]: I can’t find it now, but there was an analysis about six months ago that showed that since the change a significant majority of the most popular posts in a given month seem to originate from /r/MadeMeSmile. Prior to the API change, this was spread over an enormous number of subreddits (albeit with a significant presence by the “defaults” just due to comparative subscriber counts). While I think the subreddit distribution has gotten better since then, it’s still mostly passive meme posts that hit the site-wide top pages since the switchover, which is indicative of broader trends.

numba888 · 2025-03-14T04:59:22 1741928362

> What are they gonna do then when all the discussion boards where that data would originate are either gone or optimized into algorithmic metric farms like all the other social media sites?

As people are using AI more and more for coding and problems solving providing company can keep records and train on them. I.e. if person 1 solved the problem of doing 2 on product 3 then when 4 is trying to do the same it can be either already trained into the model or model can lookup similar problems and solutions. This way the knowledge isn't gone or isolated, it's being saved and reused. Ideally it requires permission from the user, but price cuts can be motivating. Like all main players today have free versions which can collect interaction data. With millions of users it's much more than online forums have.