I hate how automatic translations are treated on the web as if they were respectable translations, provided by professionals!
I can read three languages, and I'd like my browser to pick the "main" version of a page if it happens to be in any of those three languages, and only resort to translated ones if necessary. But there is no way to get this behaviour!
And I'd like so much for google to serve me Wikipedia pages in Italian or French if the subject concerns Italy or France, and only resort to English as a fallback, but no way, again!
If I set my main language to something other than English, I'll always get poorly translated MSDN documentation, and ugly web surfing in general. As such, I'm stuck with poorly translated youtube video titles, and English Wikipedia when searching about a monument in my hometown.
In theory the Accept-Language header should be enough to get the behavior you want, but of course that requires every single server to implement support, so in practice many sites will keep redirecting you to a different version based on IP geolocation.
But <link rel="alternate" hreflang="..." href="..."/> tags are pretty common for SEO, so maybe an extension could parse those to check whether you would prefer one of the other versions.
> In theory the Accept-Language header should be enough to get the behavior you want…
I don't think so. Accept-language doesn't allow you to differentiate between high quality and low quality content in the same language. With accept-language you could say that you prefer Italian or French over English where available, but that means you might also be served a crappy Italian/French machine translation on occasion, even though in that case you'd have preferred the English text.
(Microsoft for example seems to honour the accept-language header when looking at its documentation pages, but unfortunately that means that by default I get served the German page, which quite likely feels somewhat off due to having been machine-translated.)
There’s the Q factor in the Accept-Language header that allows you to indicate your preference for the quality level of the translation.
In practice, however, it is difficult to obtain such a quantification. Further, if you could quantify the translation, you can also just improve them instead.
> There’s the Q factor in the Accept-Language header that allows you to indicate your preference for the quality level of the translation.
???
The q factor is just for ordering the language preferences from top to bottom (i.e. it doesn't matter whether I write "de,en;q=0.8" or "en;q=0.8,de" because both of them mean that I prefer German over English because German is implicitly q=1.0 and en is explicitly only q=0.8) and doesn't say anything about the quality of the content.
They're called quality values because they're intended to negotiate the tradeoff between different ways to serve the same content in different formats of varying quality, e.g. highly compressed JPEG with artifacts vs lossless PNG.
Applying this to Accept-Language, "de,en;q=0.8" means "I prefer German content unless it's less than 80% as good as the English alternative."
Of course in practice servers tend to treat all content they have as quality 1 and all content they don't have as quality 0 and then the lovingly crafted fractional quality values collapse to a simple ordered list. ("We serve what we have and you eat what you get.")
I had to check the standard [0] on this and there seems to be some ambiguity, maybe simply with usage vs intent. The MDN entry states it is simply priority or preference [1]
"12.4.2. Quality Values
The content negotiation fields defined by this specification use a common parameter, named "q" (case-insensitive), to assign a relative "weight" to the preference for that associated kind of content. This weight is referred to as a "quality value" (or "qvalue") because the same parameter name is often used within server configurations to assign a weight to the relative quality of the various representations that can be selected for a resource."
I use the Accept-Language per site add-on [1] for this.
It allows me to map top-level domains to preferred languages. Most of the time, this has worked well for me on sites which respect the header.
For example, `.de` sites tend to be originally written in German, so I map the `*.de` pattern to the German language.
There are extensions that let you set per-site scripting policies (NoScript) and per-site cookie policies (Cookie AutoDelete, others) but I took a quick look on the Firefox add-on site and there aren't any extensions that let you set a per-site browser locale, only ones for quickly switching it globally.
I would suggest that companies should stop machine translating any language that's actually present in the Accept-Language header.
If my Accept-Language header is "no-nb, en-us, en-gb" it means that I have declared that I _know_ all these languages, but if you have a high quality version of Norwegian I'd prefer it. Do not try to be "helpful" by forcing a machine translation from English to Norwegian on me. However if there is a useful review in French, providing it machine translated is perfectly fine.
All Google products are notorious for this. Google Play, Google Maps Reviews etc. will be poorly translated into Norwegian, instead of being left as is.
I kind of wonder what it is that giving false impression to NA English speakers, that machine translations is something that works or something less worse than humans, when it isn't quite.
By wonder, I don't necessarily mean it as in euphemism for "it is infuriating to me that X is Y" wonder: I've seen translation horror stories that, there allegedly are weird dogmatically motivated English translators that constantly abuse their power to just use foreign content as vehicles for their agendas, while original authors being unable to proofread themselves or chalking up noted oddities to their own skill issues until all is too late.
That kind of weird translation is not a typical experience for me relying translations to my primary language, and I suspect nor it is to most various non-English language speakers, but especially the recent wide and big push on MT as well as high praises on GPT translations seems to roughly in line with that abusive translator horror stories.
And here I wonder; is _that_ it, or am I overthinking it?
> there allegedly are weird dogmatically motivated English translators that constantly abuse their power to just use foreign content as vehicles for their agendas, while original authors being unable to proofread themselves or chalking up noted oddities to their own skill issues until all is too late.
I've seen references to this as well, but I've not seen a source that isn't themselves trying to stir up some sort of drama or bait a particular fandom. Could you elaborate?
I suspect the underlying phenomenon is simply English monolingualism - educated people around the world are generally expected to understand at least one second language, which is usually English due to the Internet and Hollywood. So they're familiar with being able to read both sides of a translation and judge it. While the monoglots simply regard the production of a piece of text they can't read by a machine as job done.
Not GP, but an example that comes to mind is the trend of Victorian-era (British) translators bowdlerizing texts to align them with typical Victorian values and writing quirks. Although obviously not machine translated in those days :P
As a native Bulgarian speaker and web surfer, I am experiencing this first-hand with most local web content. Apart from the stories on local politics and the occasional crime-related article, most of the other content is poorly-translated text from SEO-driven English sources. It's not just the web, either. Crappy American books that were trendy decades ago are translated and marketed as a cool thing. Local writers and content producers have adopted the same artificial way of writing.
I just skip almost all our content now and go directly to english speaking sources. Press, TV, radio, online articles, books, yt videos -- a big part of them is more or less "inspired" or straight up licensed. Very difficult to come across truly original content, even when it's claimed truly original (e.g. dancers like to tell that their shows are original, but they take "master classes" from foreign teachers anyway).
Culturally, it's annoying and damaging particularly to the younger generation. We have YouTube "podcasters" dressed in loose hoodies, skimming over such articles and making a one-hour commentary show. They can never write cohesively structured text in English, yet their Bulgarian vocabulary is invaded by Americanisms used in Reddit – "cringe", "suspense", "guys", etc.
Will machine translation help low-headcount languages or be their death knell? I see multiple forces at play: machine translation lessens the pressure to properly learn whatever happens to be the lingua franca in the greater region. But it also increases competition for writers who might want to write in their small language, increasing pressure to directly address the much greater market. We might see a return to something resembling the "all educated writing is in Latin" of European history until after Newton, just with a much lower threshold.
This has some truly weird implications, e.g. once language model translations switch from subtitles to dubbing it will effectively stop shifts in pronunciation, because the models won't be re-trained with an ear to the street.
> This has some truly weird implications, e.g. once language model translations switch from subtitles to dubbing it will effectively stop shifts in pronunciation, because the models won't be re-trained with an ear to the street.
This has already happened to a great degree thanks first to radio and the television, as well as the early 20th century movement to “received pronounciation” in many countries. E.g. TV finally killed off thee and thou in the 1960s
It often feels like there are regional accent variations but they are usually quite minimal these days thanks to spread of technology.
The in Europe, explicit suppression or uniformation of language began AFAIK with Louis XIII and his deliberate formation of “France” (as opposed to just a collection of regions controlled by one person). In China I believe the same thing was instigated by the (by coincidence contemporary!) Qing dynasty, but it might have been a lot earlier. In any case it really zoomed throughout the world in the 20th century when communication technologies and practices were adopted by the emerging nationalist movements.
So machine translation will simply continue a longstanding process.
Quoting this para because it's so good as a statement of requirements:
“What we need today is a readable, audible, singable, speakable, dictatable language which we can read aloud without the need to translate into the spoken language, with the help of which we can take notes without the need to translate into the literary language, which we can [use] at the speaker’s desk as well as on the stage, and which even village grannies, women and children can understand if we read it to them. Any language that does not meet these requirements is not a living language, and can under no circumstances become the national language of our country.”
This seems like a "see three, pick two" problem. Compulsory education in the standard language will eventually solve most of these concerns. Ready understandability by elders seems quite hard to achieve, and might only possible if the new standard is quite similar to an already widespread register. And that's not even addressing the issue that daily life of some ethnic subgroups might happen in a completely different language, who might also be actively resistant of assimilating into the national culture.
Rapid communication over long distances is flattening regional differences, but that doesn't stop language change. It probably speeds it up instead, because any new trend can reach the entire population much more quickly.
Rapid communication over long distances in time might be able to put a stop to that, e.g. if teens end up interacting more with simulacra of long-dead actors than others of their own age, there could be some weird effects. But I think that's unlikely to happen, since someone is bound to come up with a more popular version that has all the newest slang.
It is also said that the TV was more harmful to Italian dialects and had more long-lasting effects than the efforts of the Fascist Party. Also in German-speaking countries, which normally have a much more friendly attitude towards their dialects, the effects of TV usage on linguistical diversity can be easily seen.
When I am back in Australia (from the US) friends/family tell me "you haven't lost your accent" which I interpret as them telling me "I watch a lot of American television".
Losing that accent was deliberate, if sad, but IME Americans aren't really tolerant of non-US accents, even when they find them cute. And the speech recognition systems are definitely intolerant.
Look into the English dialects of Yorkshire. I don’t think this is a particularly obscure fact, though of course by the 60s it was a linguistic remnant of the elderly.
It's really disappointing that the translation almost only ever goes one way: we're not going to see the greatest Bulgarian content disseminated to the English-speaking market.
(Big exception: Japan. China could have gone this route but more or less has chosen not to because of the internal political need to suppress its creative industries)
It wouldn't surprise me if there were some great translations of great Bulgarian content that are simply not as prominent among the flood of other content available in English.
10+ years ago Google was seen as savior of local languages at least in some Eastern Europe countries - Google announced that they will support all EU official languages fully in all theirs software, including Android etc. Now we can safely say that they managed to kill a lot of local language usage. Local scientists claim that most of kids younger than 20 have seen a very little texts with good quality - the majority is machine translated low quality stuff.
Not even just non-English languages. The British English version of Windows was in the spotlight not so long ago for infamously describing `.zip` files as Postcode files.
I currently don't see it as a death knell of the "small" language, but as an enshittification of the societal discourse in relatively small nations and societies. It fosters an artificial understanding of the surrounding world. It brings in poorly-understood explanatory models from outside. It stimulates young people to shallowly imitate ghetto culture from over the ocean, not understanding their own culture. It is an impediment to the articulation of local problems that need local solutions.
> It is an impediment to the articulation of local problems that need local solutions.
The extent to which US culture war topics drive UK politics infuriates me. Everyone wants to be part of the bigger, flashier show rather than deal with real things.
I'd say the impact on computer programming languages over the next 10 years will be a useful predictor for the larger impact on human languages.
The decisions about using Python vs JavaScript vs Go vs .NET vs Rust vs Java etc etc are going to quickly become moot when businesses realize they can get value without focusing on any single tech stack. The next logical step after that is reducing the inefficiencies of having so many tech stacks to maintain - I would guess by having more and more programming languages fall to the wayside, probably starting with the ones that aren't popular and ending with the ones that aren't easily ML-usable.
Why are there a multitude of human languages? Well, Tower of Babel. Or at least human isolation, if you like. The point being: for human interaction, a multitude of languages is not the best solution, but for a long time it was the only solution.
The benefits of choosing a common language are obvious and historically tested, as already pointed out in parent comment. I’ll add that today, English is the language of business. Many people have learned English in addition to their native tongues. Not because English is so wonderful, but because it gains benefits in the real world of commerce, education, politics, etc. This is a familiar concept historically. There’s a reason for the phrase Lingua franca, and for Latin being the language of the medieval church.
Now, why are there a multitude of programming languages? I know, the right tool for the right job. I say that all the time too, but that is missing the point. For computer programming, a multitude of languages was really the only solution for a long time. Now we have a prospect of a human taking a concept, telling AI, and having AI provide what is necessary for the computer to do the needed work. We’re in the midst, just like for a time a culture might learn several languages, in addition to their own, in a trade route. But the trends over time seem obvious to me.
Same in Lithuania, but half of websites are of russian origin. I wish there was a search engine that let me disable all the low-effort machine-translated websites from the results.
Croatia is a small country and we don't have vast amounts of content in our own language, so when I search I often get bombarded with machine translated content from Russian origin. I suppose it is Russian, since there are images with text written on them that is in cyrillic.
If it were Serbian (they also use cyrillic) there would be no need for translation since we basically speak the same language and a lot of Serbian content is in latin alphabet also.
I'm just genuinely curious. I also live in a small country with a limited number of L1+2 speakers and I never see any machine translated content from Russian sources. It's not immediately obvious why Croatia would have Russian and not, say, German or English sources for machine translated content.
This doesn't just extend to text but all media has become corrupted by activity on the Internet. People have started arguing with nonsense/outright lies in the real world as if they are on Twitter, TV Shows are now structures in terms of short "game puzzle" like segments where you have to collect the girl from behind enemy lines or fight the end of level boss. Everything has become more like YouTube with faster cuts and repeated phrases "again again again". Teletubbies has a lot to answer for, or maybe I am just getting old :-)
Yesterday I was looking at a credit card sized multi tool which had a perfect description in Hungarian except one thing, the auto translation picked "emperor" for ruler and not another word which should have meant the measuring thing :)
YouTube the second biggest site [0] has ~14 billion videos [1] and most of them have machine translated subtitles for all of the languages youtube supports. I could see how a niche language could have most of that language's content on the web could be in autotranslated youtube subs.
Not only does YT machine translate subtitles, it also machine translates video titles to your search language! I recently ran into a video about a cop painting pistols on some teens at a billiard party if you know what I (or YT) mean...
And you cannot disable it. I watch videos in several languages, but titles are translated into English unless I switch to another language (and then they are all in that language), but I can't tell Youtube to just present the titles in their original language.
Multilingual support sucks almost universally, and that is understandable when most of the innovation comes from countries with low rates of bilingualism, but it is becoming increasingly problematic as platforms like YouTube default to translating everything into a single language that takes several clicks to change. How hard would it be to implement a boolean option for "don't serve me translations by default"? Did they forget to add a column for the default language in their video database? It is particularly mind-boggling when some platforms offer a myriad of accessibility settings that cater to a very small percentage of the population, yet fail to account for people who speak more than one language. It is (or at least was at some point) possible to disable translations of YouTube video titles through a browser plugin or userscript, but the fact that no thought was put into making this toggleable in the first place is insane. Am I the only one who would rather have to translate a few things on my own than to see translations by default? I guess ultimately, it's user attention metrics that lead to such decisions. I imagine it will only get worse with LLMs.
The majority of people are proficient in either one language, or that language plus English. We are in a minority not worth catering to if selling eyeballs is the business model, because we are by definition proficient in a language.
The web has gotten so bad in English (and will only get worse), I can’t imagine how bad it must be in languages that only get the worst badly translated.
Markup to declare machine-generated content should be quickly standardized and made semi-compulsory (e.g. with strong search engine penalities if you're caught not using it), if we're to prevent the web to become completely useless junk
I started reading some smaller China related topics on the English Wikipedia. Every single one felt machine translated. You know, those sentences that are not even close to correct, with only the confidence of an LLM, peppered between perfectly fine English sentences.
Conversely, a lot of en Wiki articles on India-related topics have the article-dropping (i.e. no a or the before definite nouns) characteristic of Indian English. Vive (la?) difference, I say
Searching for a lot of China-related content online (in English) can be like that. I wanted to know about a specific type of tofu, every result was clearly a poor automatic translation.
I just built a translate pipeline for MittaAI: https://github.com/MittaAI/mitta-community/tree/main/cookboo.... The pipelines crawl and translate any publicly available page. The `translate` pipeline uses Gemini, but could be changed to another model for use.
this is why we should continue standardizing on english as a global language. there are serious costs to language barriers, and even if many countries maintain fluency in native tongues (China, Europe) promoting english fluency has huge economic benefits, in no small part for the developing world.
I can read three languages, and I'd like my browser to pick the "main" version of a page if it happens to be in any of those three languages, and only resort to translated ones if necessary. But there is no way to get this behaviour!
And I'd like so much for google to serve me Wikipedia pages in Italian or French if the subject concerns Italy or France, and only resort to English as a fallback, but no way, again!
If I set my main language to something other than English, I'll always get poorly translated MSDN documentation, and ugly web surfing in general. As such, I'm stuck with poorly translated youtube video titles, and English Wikipedia when searching about a monument in my hometown.