Hacker News new | past | comments | ask | show | jobs | submit login

Get concerned when you see a real product in the market that has a sustainable business model.

The man behind the curtain here has an army of engineers, unlimited cloud nodes and basically has harvested all the data currently available in the world.

It doesn't get any better than this right now.

What's next? They'll ping you later on Linked-in with this awesome idea that you need to make sure runs in a $1 USD microcontroller with a rechargeable battery that is supposed to last at least all day.

The actual scary stuff is the dilution of expertise, we contributed for a long time to share our knowledge for internet points (stack overflow, open source projects, etc), and it has been harvested by the AIs already, anyone that pays access to these services for tens of dollars a month can bootstrap really quickly and do what it might had needed years of expertise before.

It will dilute little by little our current service value, but you know what, it has always been like this forever, it is just faster.

In the meantime, learn to automate the automator, that's the way to get ahead.




Man, we shared our knowledge via books long before the internet. And a lot of those AI models train off of thousands of books as a base before they try to incorporate less accurate knowledge from the wild internet. The cat was out of the bag on that long ago.


I saw Musk saying a couple of days ago that we've "hit the limit of peak data" for training AI. My immediate reaction was no, surely you have not trained on every copyrighted textbook on every subject ever written. You hit the peak of easily accessible internet data that you could quickly steal to train your models.


Meta famously used libgen to train, right? That is basically a source for all copyrighted textbooks and more.


You might not know it, but there is no data for AI in robotics.

Everyone has to collect their own data and pool it together or else there won't be any progress.


The 82TB Meta trained on is still a lot of textbooks.


I can’t help but think that’s the real reasons he wants five billets from every federal worker every week. Free, hot, and fresh data!


Eventually humans ability to create new fresh data will be the justification for UBI. Fo shizzle


It doesn't get any better than this right now.

And it won't ever get any worse.


You sure about that? Google Search is backed by a pretty big-serious ML model, and it's gotten a lot worse in just the last few years.


There are other search engines that are on par with Google search from a few years ago. Brave search is particularly good.

These were developed without the big bucks, so the tech has improved for smaller players at least.


Valid point there for sure.

But yes, in general, models won't get worse than they are now (or if they do, they won't stay that way.) At Google, search has been enshittified for business reasons, not technical ones.


"enshitification" suggests otherwise


> The actual scary stuff is the dilution of expertise, we contributed for a long time to share our knowledge for internet points (stack overflow, open source projects, etc), and it has been harvested by the AIs already, anyone that pays access to these services for tens of dollars a month can bootstrap really quickly and do what it might had needed years of expertise before.

What scares me more is the opposite of that: information scarcity leading to less accessible intelligence on newer topics.

I’ve completely stopped posting on Reddit since the API changes, and I was extremely prolific before[1] because I genuinely love writing about random things that interest me. I know I’m not the only one: anecdotally, the overall quality of content on Reddit has nosedived since the change and while there doesn’t seem to be a drop in traffic or activity, data seems to indicate that the vast majority of activity these days is disposable meme content[2]. This seems to be because they’re attempting desperately to stick recommendation algorithms everywhere they can, which are heavily weighted toward disposable content since people view more of it. So even if there were just as many long discussion posts like before, they’re not getting surfaced nearly as often. And discussion quality when it does happen has noticeably dipped as well: the Severance subreddit has regularly gotten posts and comments where people question things that have already been fully explained in the series itself (not like subtext kind of things, like “a character looked at the camera and blatantly said that in the episode you’re talking about having just watched” things). Those would have been heavily downvoted years ago, now they’re the norm.

But if LLMs learn from the in-depth posting that used to be prominent across the Internet, and that kind of in-depth posting is no longer present, a new problem presents itself. If, let’s say, a new framework releases tomorrow and becomes the next big thing, where is ChatGPT going to learn how that framework works? Most new products and platforms seem to centralize their discussion on Discord, and that’s not being fed into any LLMs that I’m aware of. Reddit post quality has nosedived. Stack Overflow keeps trying to replace different parts of its Q&A system with weird variants of AI because “it’s what visitors expect to see these days.” So we’re left with whatever documentation is available on the open Internet, and a few mediocre-quality forum posts and Reddit threads.

An LLM might be able to pull together some meaning out of that data combined with the existing data it has. But what about the framework after that? And the language after that? There’s less and less information available each time.

“Model collapse” doesn’t seem to have panned out: as long as you have external human raters, you can use AI-generated information in training. (IIRC the original model collapse discussions were the result of AI attempting to rate AI generated content and then feed right back in; that obviously didn’t work since the rater models aren’t typically any better than the generator models.) But what if the “data wells” dry up eventually? They can kick the can down the road for a while with existing data (for example LLMs can relate the quirks of new languages to the quirks of existing languages, or text to image models can learn about characters from newer media by using what it already knows about how similar characters look as a baseline), but eventually quality will start to deteriorate without new high-quality data inputs.

What are they gonna do then when all the discussion boards where that data would originate are either gone or optimized into algorithmic metric farms like all the other social media sites?

[1]: https://old.reddit.com/user/Nathan2055

[2]: I can’t find it now, but there was an analysis about six months ago that showed that since the change a significant majority of the most popular posts in a given month seem to originate from /r/MadeMeSmile. Prior to the API change, this was spread over an enormous number of subreddits (albeit with a significant presence by the “defaults” just due to comparative subscriber counts). While I think the subreddit distribution has gotten better since then, it’s still mostly passive meme posts that hit the site-wide top pages since the switchover, which is indicative of broader trends.


> What are they gonna do then when all the discussion boards where that data would originate are either gone or optimized into algorithmic metric farms like all the other social media sites?

As people are using AI more and more for coding and problems solving providing company can keep records and train on them. I.e. if person 1 solved the problem of doing 2 on product 3 then when 4 is trying to do the same it can be either already trained into the model or model can lookup similar problems and solutions. This way the knowledge isn't gone or isolated, it's being saved and reused. Ideally it requires permission from the user, but price cuts can be motivating. Like all main players today have free versions which can collect interaction data. With millions of users it's much more than online forums have.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: