Hacker News new | past | comments | ask | show | jobs | submit login

This kind of "social experiment" just makes everything worse for everyone.



The only datasets that will be useful to train LLMs in the future will be the ones generated before 2022. Any content generated after this date will be analogous to steel forged after 1945, it will be inevitably contaminated by the "radioactivity" of LLMs.

The good news is that the availability of data to train more and more powerful models will soon be gone, the bad news is it will take the internet as we know it with it.

It will be a sad day when most of HN posts are AI generated, but this day will come, it's pretty much inevitable. The post above us is just a drop in an ocean of garbage generators that are just starting to pop up all around the old human web that we used to "love". We'll probably miss old Twitter someday, as ridiculous as it sounds.


The good news is that this will mostly affect English, but most other languages are likely to keep being mostly generated by humans. This could even encourage people to use their own language more on the internet, which I think is a win for human cultural diversity.

I don't know if there is any escape from this for native English speakers, though.


Most other languages (at least the ones I know) are already hugely polluted by useless content that was (badly) machine translated from English. Such spam sites are now a majority of search results for me when I use Duckduckgo or Google.


That's not wrong, although it's often very easy to spot.


How so? LLMs like GPT4 have no issue generating text in Spanish, for example.


Probably the larger languages will be affected somewhat as well (I can't test Spanish but I've used GPT3.5 in French without issues) but not as much I think, such automated attacks seem to most often be targeted at English (I suppose if you're doing something like that, it's both easier to use English and also gives better returns (whatever they are) since there are much more English readers on the Internet).

On smaller languages though, GPT is often not good enough to use without a lot of supervision. Like it can give a good impression of West-Flemish, but can't simulate an actual conversation on an actual topic. Even just Dutch is kind of hit-and-miss.


GPT-4 tends to screw up the grammar in other languages, I imagine in proportion to inverse of the language's prevalence in the training data.

I often work with GPT-4 in Polish. I don't think I've ever had it give me an answer in Polish without at least one grammatical mistake somewhere per every two or three paragraphs. The text itself is still superb, and its command of vocabulary better than that of a median native speaker, but it revels itself by confusing genders, or forgetting about the grammatical case suffixes.


Spanish is probably the second most easiest one due to the sheer amount of data you can train it on. The less common the language is, the shittier the output becomes.

It is utterly useless at generating pretty much anything in my native language (Bosnian/Croatian/Montenegrin/Serbian, however you wanna call it). Like you don't even have to try to trick it, even if you try the simplest of prompts it will produce instantly dismissible garbage.

Like it's technically not wrong, it's (mostly) grammatically correct, but it produces sentences in such a robotic way no human ever would. Hell, even generating a prompt in English and then using Google Translate makes it sound more natural than straight giving it a prompt in my language. We don't need those AI detection tools, you can take one glimpse at a text and know with 100% certainty it's not written by a human.


As someone who has moderated several popular message boards over the years, I can assure you that the problem of machine generated spam is nearly as old as HTML itself.


It’s interesting what’s possible, and perhaps it shows off how low-thought a lot of human discussion is.

But in the end it’s noise and it pollutes human communication channels. It’s already hard enough to have an honest discussion when there are profit motives and agendas at play. Now we have effectively added probabilistic noise to the mix.

I don’t particularly fault the author for doing it, I’m sure it was fun and intellectually rewarding, and they’re unlikely to be the only one. But still.


Well the current implementation only comments on posts that already have a lot of upvotes so it's unlikely that most of the posts are read by humans. As far as I can tell there's no clear path to making any money with it short of becoming a foreign agent or selling upvotes, neither of which I am interested in. So I will shut it off soon because there's not really much else to do.


The replies are believably human, but kind of banal. If anything, this might indicate you've captured the gestalt of the median social media user.

This one was my favorite, and happened to be the only one that got more than one upvote:

> Why are old people so obsessed with collecting things like spoons, thimbles, and shot glasses? It's like they want to have a tiny version of every object in the world.


Looks like OP replied here

https://old.reddit.com/r/IAmA/comments/13tgscb/im_hasard_f16...

I mean personally I don’t mind. It is better than some of the human comments I’ve seen…


Just imagine that every comment you responded to was someone else performing the same kind of "experiment." Does it change how much you want to engage with the community?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: