The only datasets that will be useful to train LLMs in the future will be the on...

seszett · on June 5, 2023

The good news is that this will mostly affect English, but most other languages are likely to keep being mostly generated by humans. This could even encourage people to use their own language more on the internet, which I think is a win for human cultural diversity.

I don't know if there is any escape from this for native English speakers, though.

avian · on June 5, 2023

Most other languages (at least the ones I know) are already hugely polluted by useless content that was (badly) machine translated from English. Such spam sites are now a majority of search results for me when I use Duckduckgo or Google.

seszett · on June 5, 2023

That's not wrong, although it's often very easy to spot.

dieortin · on June 5, 2023

How so? LLMs like GPT4 have no issue generating text in Spanish, for example.

seszett · on June 5, 2023

Probably the larger languages will be affected somewhat as well (I can't test Spanish but I've used GPT3.5 in French without issues) but not as much I think, such automated attacks seem to most often be targeted at English (I suppose if you're doing something like that, it's both easier to use English and also gives better returns (whatever they are) since there are much more English readers on the Internet).

On smaller languages though, GPT is often not good enough to use without a lot of supervision. Like it can give a good impression of West-Flemish, but can't simulate an actual conversation on an actual topic. Even just Dutch is kind of hit-and-miss.

TeMPOraL · on June 5, 2023

GPT-4 tends to screw up the grammar in other languages, I imagine in proportion to inverse of the language's prevalence in the training data.

I often work with GPT-4 in Polish. I don't think I've ever had it give me an answer in Polish without at least one grammatical mistake somewhere per every two or three paragraphs. The text itself is still superb, and its command of vocabulary better than that of a median native speaker, but it revels itself by confusing genders, or forgetting about the grammatical case suffixes.

input_sh · on June 5, 2023

Spanish is probably the second most easiest one due to the sheer amount of data you can train it on. The less common the language is, the shittier the output becomes.

It is utterly useless at generating pretty much anything in my native language (Bosnian/Croatian/Montenegrin/Serbian, however you wanna call it). Like you don't even have to try to trick it, even if you try the simplest of prompts it will produce instantly dismissible garbage.

Like it's technically not wrong, it's (mostly) grammatically correct, but it produces sentences in such a robotic way no human ever would. Hell, even generating a prompt in English and then using Google Translate makes it sound more natural than straight giving it a prompt in my language. We don't need those AI detection tools, you can take one glimpse at a text and know with 100% certainty it's not written by a human.

hnlmorg · on June 5, 2023

As someone who has moderated several popular message boards over the years, I can assure you that the problem of machine generated spam is nearly as old as HTML itself.