Hacker News new | past | comments | ask | show | jobs | submit login

The only datasets that will be useful to train LLMs in the future will be the ones generated before 2022. Any content generated after this date will be analogous to steel forged after 1945, it will be inevitably contaminated by the "radioactivity" of LLMs.

The good news is that the availability of data to train more and more powerful models will soon be gone, the bad news is it will take the internet as we know it with it.

It will be a sad day when most of HN posts are AI generated, but this day will come, it's pretty much inevitable. The post above us is just a drop in an ocean of garbage generators that are just starting to pop up all around the old human web that we used to "love". We'll probably miss old Twitter someday, as ridiculous as it sounds.




The good news is that this will mostly affect English, but most other languages are likely to keep being mostly generated by humans. This could even encourage people to use their own language more on the internet, which I think is a win for human cultural diversity.

I don't know if there is any escape from this for native English speakers, though.


Most other languages (at least the ones I know) are already hugely polluted by useless content that was (badly) machine translated from English. Such spam sites are now a majority of search results for me when I use Duckduckgo or Google.


That's not wrong, although it's often very easy to spot.


How so? LLMs like GPT4 have no issue generating text in Spanish, for example.


Probably the larger languages will be affected somewhat as well (I can't test Spanish but I've used GPT3.5 in French without issues) but not as much I think, such automated attacks seem to most often be targeted at English (I suppose if you're doing something like that, it's both easier to use English and also gives better returns (whatever they are) since there are much more English readers on the Internet).

On smaller languages though, GPT is often not good enough to use without a lot of supervision. Like it can give a good impression of West-Flemish, but can't simulate an actual conversation on an actual topic. Even just Dutch is kind of hit-and-miss.


GPT-4 tends to screw up the grammar in other languages, I imagine in proportion to inverse of the language's prevalence in the training data.

I often work with GPT-4 in Polish. I don't think I've ever had it give me an answer in Polish without at least one grammatical mistake somewhere per every two or three paragraphs. The text itself is still superb, and its command of vocabulary better than that of a median native speaker, but it revels itself by confusing genders, or forgetting about the grammatical case suffixes.


Spanish is probably the second most easiest one due to the sheer amount of data you can train it on. The less common the language is, the shittier the output becomes.

It is utterly useless at generating pretty much anything in my native language (Bosnian/Croatian/Montenegrin/Serbian, however you wanna call it). Like you don't even have to try to trick it, even if you try the simplest of prompts it will produce instantly dismissible garbage.

Like it's technically not wrong, it's (mostly) grammatically correct, but it produces sentences in such a robotic way no human ever would. Hell, even generating a prompt in English and then using Google Translate makes it sound more natural than straight giving it a prompt in my language. We don't need those AI detection tools, you can take one glimpse at a text and know with 100% certainty it's not written by a human.


As someone who has moderated several popular message boards over the years, I can assure you that the problem of machine generated spam is nearly as old as HTML itself.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: