Hacker News new | past | comments | ask | show | jobs | submit login
Much of the Web Is Machine Translated: Insights from Multi-Way Parallelism (arxiv.org)
95 points by yorwba 12 months ago | hide | past | favorite | 75 comments



I hate how automatic translations are treated on the web as if they were respectable translations, provided by professionals!

I can read three languages, and I'd like my browser to pick the "main" version of a page if it happens to be in any of those three languages, and only resort to translated ones if necessary. But there is no way to get this behaviour!

And I'd like so much for google to serve me Wikipedia pages in Italian or French if the subject concerns Italy or France, and only resort to English as a fallback, but no way, again!

If I set my main language to something other than English, I'll always get poorly translated MSDN documentation, and ugly web surfing in general. As such, I'm stuck with poorly translated youtube video titles, and English Wikipedia when searching about a monument in my hometown.


In theory the Accept-Language header should be enough to get the behavior you want, but of course that requires every single server to implement support, so in practice many sites will keep redirecting you to a different version based on IP geolocation.

But <link rel="alternate" hreflang="..." href="..."/> tags are pretty common for SEO, so maybe an extension could parse those to check whether you would prefer one of the other versions.


> In theory the Accept-Language header should be enough to get the behavior you want…

I don't think so. Accept-language doesn't allow you to differentiate between high quality and low quality content in the same language. With accept-language you could say that you prefer Italian or French over English where available, but that means you might also be served a crappy Italian/French machine translation on occasion, even though in that case you'd have preferred the English text.

(Microsoft for example seems to honour the accept-language header when looking at its documentation pages, but unfortunately that means that by default I get served the German page, which quite likely feels somewhat off due to having been machine-translated.)


There’s the Q factor in the Accept-Language header that allows you to indicate your preference for the quality level of the translation.

In practice, however, it is difficult to obtain such a quantification. Further, if you could quantify the translation, you can also just improve them instead.


> There’s the Q factor in the Accept-Language header that allows you to indicate your preference for the quality level of the translation.

???

The q factor is just for ordering the language preferences from top to bottom (i.e. it doesn't matter whether I write "de,en;q=0.8" or "en;q=0.8,de" because both of them mean that I prefer German over English because German is implicitly q=1.0 and en is explicitly only q=0.8) and doesn't say anything about the quality of the content.


They're called quality values because they're intended to negotiate the tradeoff between different ways to serve the same content in different formats of varying quality, e.g. highly compressed JPEG with artifacts vs lossless PNG.

Applying this to Accept-Language, "de,en;q=0.8" means "I prefer German content unless it's less than 80% as good as the English alternative."

Of course in practice servers tend to treat all content they have as quality 1 and all content they don't have as quality 0 and then the lovingly crafted fractional quality values collapse to a simple ordered list. ("We serve what we have and you eat what you get.")


I had to check the standard [0] on this and there seems to be some ambiguity, maybe simply with usage vs intent. The MDN entry states it is simply priority or preference [1]

"12.4.2. Quality Values

The content negotiation fields defined by this specification use a common parameter, named "q" (case-insensitive), to assign a relative "weight" to the preference for that associated kind of content. This weight is referred to as a "quality value" (or "qvalue") because the same parameter name is often used within server configurations to assign a weight to the relative quality of the various representations that can be selected for a resource."

[0] https://www.rfc-editor.org/rfc/rfc9110.html#quality.values

[1] https://developer.mozilla.org/en-US/docs/Glossary/Quality_va...


> In theory the Accept-Language header should be enough

No, it is not enough, because there is no way to tell: "Pick the source in my 3rd language rather than the translation in my 1st language".


I use the Accept-Language per site add-on [1] for this. It allows me to map top-level domains to preferred languages. Most of the time, this has worked well for me on sites which respect the header.

For example, `.de` sites tend to be originally written in German, so I map the `*.de` pattern to the German language.

[1]: https://addons.mozilla.org/en-US/firefox/addon/accept-langua...


The browser UI is missing. There might be extensions that scratch that itch.


There are extensions that let you set per-site scripting policies (NoScript) and per-site cookie policies (Cookie AutoDelete, others) but I took a quick look on the Firefox add-on site and there aren't any extensions that let you set a per-site browser locale, only ones for quickly switching it globally.


I haven’t used it myself, but someone elsewhere in the comments linked this extension:

https://addons.mozilla.org/en-US/firefox/addon/accept-langua...


I would suggest that companies should stop machine translating any language that's actually present in the Accept-Language header.

If my Accept-Language header is "no-nb, en-us, en-gb" it means that I have declared that I _know_ all these languages, but if you have a high quality version of Norwegian I'd prefer it. Do not try to be "helpful" by forcing a machine translation from English to Norwegian on me. However if there is a useful review in French, providing it machine translated is perfectly fine.

All Google products are notorious for this. Google Play, Google Maps Reviews etc. will be poorly translated into Norwegian, instead of being left as is.


The web, like all software, is designed mostly for well off white American men who live in Northern California. They don’t tend to even speak Spanish.

Multi linguals face UX issues most professionals in the space are completely ignorant of.


I kind of wonder what it is that giving false impression to NA English speakers, that machine translations is something that works or something less worse than humans, when it isn't quite.

By wonder, I don't necessarily mean it as in euphemism for "it is infuriating to me that X is Y" wonder: I've seen translation horror stories that, there allegedly are weird dogmatically motivated English translators that constantly abuse their power to just use foreign content as vehicles for their agendas, while original authors being unable to proofread themselves or chalking up noted oddities to their own skill issues until all is too late.

That kind of weird translation is not a typical experience for me relying translations to my primary language, and I suspect nor it is to most various non-English language speakers, but especially the recent wide and big push on MT as well as high praises on GPT translations seems to roughly in line with that abusive translator horror stories.

And here I wonder; is _that_ it, or am I overthinking it?


> there allegedly are weird dogmatically motivated English translators that constantly abuse their power to just use foreign content as vehicles for their agendas, while original authors being unable to proofread themselves or chalking up noted oddities to their own skill issues until all is too late.

I've seen references to this as well, but I've not seen a source that isn't themselves trying to stir up some sort of drama or bait a particular fandom. Could you elaborate?

I suspect the underlying phenomenon is simply English monolingualism - educated people around the world are generally expected to understand at least one second language, which is usually English due to the Internet and Hollywood. So they're familiar with being able to read both sides of a translation and judge it. While the monoglots simply regard the production of a piece of text they can't read by a machine as job done.


Not GP, but an example that comes to mind is the trend of Victorian-era (British) translators bowdlerizing texts to align them with typical Victorian values and writing quirks. Although obviously not machine translated in those days :P


As a native Bulgarian speaker and web surfer, I am experiencing this first-hand with most local web content. Apart from the stories on local politics and the occasional crime-related article, most of the other content is poorly-translated text from SEO-driven English sources. It's not just the web, either. Crappy American books that were trendy decades ago are translated and marketed as a cool thing. Local writers and content producers have adopted the same artificial way of writing.


Polish one here.

I just skip almost all our content now and go directly to english speaking sources. Press, TV, radio, online articles, books, yt videos -- a big part of them is more or less "inspired" or straight up licensed. Very difficult to come across truly original content, even when it's claimed truly original (e.g. dancers like to tell that their shows are original, but they take "master classes" from foreign teachers anyway).


Culturally, it's annoying and damaging particularly to the younger generation. We have YouTube "podcasters" dressed in loose hoodies, skimming over such articles and making a one-hour commentary show. They can never write cohesively structured text in English, yet their Bulgarian vocabulary is invaded by Americanisms used in Reddit – "cringe", "suspense", "guys", etc.


What you are hearing as "suspense" is likely "sus". "sus" is short for "suspicious", and was popularized by the viral video game 'Among Us'.


When I learned French in school 25 years ago, Americanisms were a major part of the vocabulary, especially in cool guy slang.


Im sorry to tell you this, but it may just be that you are intelligent enough to notice, and many others are not.

I bet you a lot of people even like it and would defend it. That could be why it is prolific.


I dunno. I think people are smarter than you give them credit for.


I believe in the immense potential of hardworking individuals, and usually assume people are way smarter than they are.

My idealist bubble is often burst when I'm out actually interacting with them.


Today’s top news story about Iowa is enough to confirm.


Many such cases.


If anything, recent years have shown people to be even dumber than my most pessimistic assumptions.


Will machine translation help low-headcount languages or be their death knell? I see multiple forces at play: machine translation lessens the pressure to properly learn whatever happens to be the lingua franca in the greater region. But it also increases competition for writers who might want to write in their small language, increasing pressure to directly address the much greater market. We might see a return to something resembling the "all educated writing is in Latin" of European history until after Newton, just with a much lower threshold.

This has some truly weird implications, e.g. once language model translations switch from subtitles to dubbing it will effectively stop shifts in pronunciation, because the models won't be re-trained with an ear to the street.


> This has some truly weird implications, e.g. once language model translations switch from subtitles to dubbing it will effectively stop shifts in pronunciation, because the models won't be re-trained with an ear to the street.

This has already happened to a great degree thanks first to radio and the television, as well as the early 20th century movement to “received pronounciation” in many countries. E.g. TV finally killed off thee and thou in the 1960s

It often feels like there are regional accent variations but they are usually quite minimal these days thanks to spread of technology.

The in Europe, explicit suppression or uniformation of language began AFAIK with Louis XIII and his deliberate formation of “France” (as opposed to just a collection of regions controlled by one person). In China I believe the same thing was instigated by the (by coincidence contemporary!) Qing dynasty, but it might have been a lot earlier. In any case it really zoomed throughout the world in the 20th century when communication technologies and practices were adopted by the emerging nationalist movements.

So machine translation will simply continue a longstanding process.


Chinese language unification has been a massive, longstanding, incomplete project for basically all the history of China. https://www.globalasia.org/v12no2/feature/chinas-long-strugg...

Quoting this para because it's so good as a statement of requirements:

“What we need today is a readable, audible, singable, speakable, dictatable language which we can read aloud without the need to translate into the spoken language, with the help of which we can take notes without the need to translate into the literary language, which we can [use] at the speaker’s desk as well as on the stage, and which even village grannies, women and children can understand if we read it to them. Any language that does not meet these requirements is not a living language, and can under no circumstances become the national language of our country.”


This seems like a "see three, pick two" problem. Compulsory education in the standard language will eventually solve most of these concerns. Ready understandability by elders seems quite hard to achieve, and might only possible if the new standard is quite similar to an already widespread register. And that's not even addressing the issue that daily life of some ethnic subgroups might happen in a completely different language, who might also be actively resistant of assimilating into the national culture.


Though the languages of China were quite diverse well beyond the Chinese Revolution.


Rapid communication over long distances is flattening regional differences, but that doesn't stop language change. It probably speeds it up instead, because any new trend can reach the entire population much more quickly.

Rapid communication over long distances in time might be able to put a stop to that, e.g. if teens end up interacting more with simulacra of long-dead actors than others of their own age, there could be some weird effects. But I think that's unlikely to happen, since someone is bound to come up with a more popular version that has all the newest slang.


Also true, though I was responding to gp’s comment about pronunciation flattening.

However now you mention it I wonder if automatic translation might also cramp or otherwise affect minority languages spoken by a small population.

Also I wonder if multi-step automatic translation will cause weirdness in spoken languages (again, mainly in minority languages)


> TV finally killed off thee and thou in the 1960s

Can someone provide a source for this? Could make for some interesting discussion points, but I don't want to propagate unverified information


It is also said that the TV was more harmful to Italian dialects and had more long-lasting effects than the efforts of the Fascist Party. Also in German-speaking countries, which normally have a much more friendly attitude towards their dialects, the effects of TV usage on linguistical diversity can be easily seen.


When I am back in Australia (from the US) friends/family tell me "you haven't lost your accent" which I interpret as them telling me "I watch a lot of American television".

Losing that accent was deliberate, if sad, but IME Americans aren't really tolerant of non-US accents, even when they find them cute. And the speech recognition systems are definitely intolerant.


Look into the English dialects of Yorkshire. I don’t think this is a particularly obscure fact, though of course by the 60s it was a linguistic remnant of the elderly.


May have killed off thee and thou, but ye and yous(e) are alive in spoken English in Ireland.


It's really disappointing that the translation almost only ever goes one way: we're not going to see the greatest Bulgarian content disseminated to the English-speaking market.

(Big exception: Japan. China could have gone this route but more or less has chosen not to because of the internal political need to suppress its creative industries)


There is actually a sizeable industry translating Chinese light novels into English https://www.wuxiaworld.com/about

It wouldn't surprise me if there were some great translations of great Bulgarian content that are simply not as prominent among the flood of other content available in English.


I’m currently enjoying The Three Body Problem, written in Chinese around 2005’ and translated around 2018.


10+ years ago Google was seen as savior of local languages at least in some Eastern Europe countries - Google announced that they will support all EU official languages fully in all theirs software, including Android etc. Now we can safely say that they managed to kill a lot of local language usage. Local scientists claim that most of kids younger than 20 have seen a very little texts with good quality - the majority is machine translated low quality stuff.


Not even eastern Europe. The German translation of MS Teams lists my Co-workers as free of charge...


Not even just non-English languages. The British English version of Windows was in the spotlight not so long ago for infamously describing `.zip` files as Postcode files.


Too bad the translators hadn’t thought about the FOSS ‘free’ distinction.


I was thinking more of an innuendo on sex work but maybe that's just me


I currently don't see it as a death knell of the "small" language, but as an enshittification of the societal discourse in relatively small nations and societies. It fosters an artificial understanding of the surrounding world. It brings in poorly-understood explanatory models from outside. It stimulates young people to shallowly imitate ghetto culture from over the ocean, not understanding their own culture. It is an impediment to the articulation of local problems that need local solutions.


> It is an impediment to the articulation of local problems that need local solutions.

The extent to which US culture war topics drive UK politics infuriates me. Everyone wants to be part of the bigger, flashier show rather than deal with real things.


I'd say the impact on computer programming languages over the next 10 years will be a useful predictor for the larger impact on human languages.

The decisions about using Python vs JavaScript vs Go vs .NET vs Rust vs Java etc etc are going to quickly become moot when businesses realize they can get value without focusing on any single tech stack. The next logical step after that is reducing the inefficiencies of having so many tech stacks to maintain - I would guess by having more and more programming languages fall to the wayside, probably starting with the ones that aren't popular and ending with the ones that aren't easily ML-usable.


I reckon you’re a manager or VC if you think it is just a matter of translating between lanagey


Why are there a multitude of human languages? Well, Tower of Babel. Or at least human isolation, if you like. The point being: for human interaction, a multitude of languages is not the best solution, but for a long time it was the only solution.

The benefits of choosing a common language are obvious and historically tested, as already pointed out in parent comment. I’ll add that today, English is the language of business. Many people have learned English in addition to their native tongues. Not because English is so wonderful, but because it gains benefits in the real world of commerce, education, politics, etc. This is a familiar concept historically. There’s a reason for the phrase Lingua franca, and for Latin being the language of the medieval church.

Now, why are there a multitude of programming languages? I know, the right tool for the right job. I say that all the time too, but that is missing the point. For computer programming, a multitude of languages was really the only solution for a long time. Now we have a prospect of a human taking a concept, telling AI, and having AI provide what is necessary for the computer to do the needed work. We’re in the midst, just like for a time a culture might learn several languages, in addition to their own, in a trade route. But the trends over time seem obvious to me.


Same in Lithuania, but half of websites are of russian origin. I wish there was a search engine that let me disable all the low-effort machine-translated websites from the results.


Maybe there will be soon, actually I see it as inevitable the so called "search engines" to become even more content aware, regarding synthetic text


Same in Croatia with Russian content.


Can you explain? I'm curious to why this is the case. It's more evident to me in the case of Lithuania.


Not sure I understand the question.

Croatia is a small country and we don't have vast amounts of content in our own language, so when I search I often get bombarded with machine translated content from Russian origin. I suppose it is Russian, since there are images with text written on them that is in cyrillic.

If it were Serbian (they also use cyrillic) there would be no need for translation since we basically speak the same language and a lot of Serbian content is in latin alphabet also.


I'm just genuinely curious. I also live in a small country with a limited number of L1+2 speakers and I never see any machine translated content from Russian sources. It's not immediately obvious why Croatia would have Russian and not, say, German or English sources for machine translated content.


This doesn't just extend to text but all media has become corrupted by activity on the Internet. People have started arguing with nonsense/outright lies in the real world as if they are on Twitter, TV Shows are now structures in terms of short "game puzzle" like segments where you have to collect the girl from behind enemy lines or fight the end of level boss. Everything has become more like YouTube with faster cuts and repeated phrases "again again again". Teletubbies has a lot to answer for, or maybe I am just getting old :-)


Yesterday I was looking at a credit card sized multi tool which had a perfect description in Hungarian except one thing, the auto translation picked "emperor" for ruler and not another word which should have meant the measuring thing :)


To be fair, a lot of the shit I buy these days from China has equivalent mistakes in “English”.


YouTube the second biggest site [0] has ~14 billion videos [1] and most of them have machine translated subtitles for all of the languages youtube supports. I could see how a niche language could have most of that language's content on the web could be in autotranslated youtube subs.

[0] https://www.similarweb.com/top-websites/

[1] https://tubestats.org/


Not only does YT machine translate subtitles, it also machine translates video titles to your search language! I recently ran into a video about a cop painting pistols on some teens at a billiard party if you know what I (or YT) mean...


And you cannot disable it. I watch videos in several languages, but titles are translated into English unless I switch to another language (and then they are all in that language), but I can't tell Youtube to just present the titles in their original language.

Really annoying and limiting.


Multilingual support sucks almost universally, and that is understandable when most of the innovation comes from countries with low rates of bilingualism, but it is becoming increasingly problematic as platforms like YouTube default to translating everything into a single language that takes several clicks to change. How hard would it be to implement a boolean option for "don't serve me translations by default"? Did they forget to add a column for the default language in their video database? It is particularly mind-boggling when some platforms offer a myriad of accessibility settings that cater to a very small percentage of the population, yet fail to account for people who speak more than one language. It is (or at least was at some point) possible to disable translations of YouTube video titles through a browser plugin or userscript, but the fact that no thought was put into making this toggleable in the first place is insane. Am I the only one who would rather have to translate a few things on my own than to see translations by default? I guess ultimately, it's user attention metrics that lead to such decisions. I imagine it will only get worse with LLMs.


The majority of people are proficient in either one language, or that language plus English. We are in a minority not worth catering to if selling eyeballs is the business model, because we are by definition proficient in a language.


The web has gotten so bad in English (and will only get worse), I can’t imagine how bad it must be in languages that only get the worst badly translated.


Markup to declare machine-generated content should be quickly standardized and made semi-compulsory (e.g. with strong search engine penalities if you're caught not using it), if we're to prevent the web to become completely useless junk


I started reading some smaller China related topics on the English Wikipedia. Every single one felt machine translated. You know, those sentences that are not even close to correct, with only the confidence of an LLM, peppered between perfectly fine English sentences.


Conversely, a lot of en Wiki articles on India-related topics have the article-dropping (i.e. no a or the before definite nouns) characteristic of Indian English. Vive (la?) difference, I say


Searching for a lot of China-related content online (in English) can be like that. I wanted to know about a specific type of tofu, every result was clearly a poor automatic translation.


Probably a mix of users of varying English ability adding bits and pieces to the same article. Some using machine translation, some not.


I just built a translate pipeline for MittaAI: https://github.com/MittaAI/mitta-community/tree/main/cookboo.... The pipelines crawl and translate any publicly available page. The `translate` pipeline uses Gemini, but could be changed to another model for use.


this is why we should continue standardizing on english as a global language. there are serious costs to language barriers, and even if many countries maintain fluency in native tongues (China, Europe) promoting english fluency has huge economic benefits, in no small part for the developing world.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: