> Have you ever wondered how Chrome knows to ask you if you’d like a web page’s content to be translated? No? Okay, maybe it’s just me then. But it’s because of the lang attribute on the <html> element.
> Chrome uses CLD3 to perform language detection on all webpages. This language detection is generally very accurate. Chrome will override the language detection if there is a language attribute on the HTML tag or a language specified in the HTTP content-language header.
> However, both these signals are often incorrectly specified by the site/page, and in particular many non-English sites report the language as English (presumably based on default values from authoring tools, etc.) Currently a whitelist is used to determine if the detected language should always override the language attribute / content-language.
Compact Language Detector v3 (CLD3) is a neural network model for language identification [1]. Sometimes it fails [2].
I don't know if it is the case nowadays, but I seem to recall that in the old days they would check and if their model was highly predicting a language different than the one you were claiming they would serve the language they predicted.
Which, even if they don't still do this, I think makes total sense and is what I would do because obviously programmers make mistakes and if your page claims to be English but your language has a high chance of being Armenian and a low chance of being English I would consider it was Armenian.
Google crawler apparently ignores "lang" attribute [0]:
> Google uses the visible content of your page to determine its language. We don’t use any code-level language information such as lang attributes, or the URL.
I learned a lot from the content in case I need to do more localization/i18n for future.
Nice easter egg, first time seeing emoji used in the url fragment. It also loads a new one.
Edit:
As an idea to share, I wonder given the topic, whether different emoji's can be localized too. An emoji can mean something different depending on country.
One thing that’s been omitted is the :dir() pseudoclass[1] which has been inexplicably ignored by Webkit and Blink. Yeah, it can sort of be replicated by a descendant selector, but it’d be so much more obvious and selfcontained to select an element based on its own calculated text direction; something that’s currently only possible in Firefox.
The article touches on having properties like borders and margins that accommodate the language, but all the examples are manually calculated. I recently saw a talk on Youtube that mentioned that there is (or is coming?) support for margin-start/end kind of syntax that will allow browsers to handle re-orientation of box properties depending on the language. Sadly, can't find it. It was by a pair of people from the Chrome and Edge teams updating on some new features coming. Obviously browser support and awareness will take time to normalize these patterns, but it will help remove the need for many of these manual considerations, which means that support for internationalization should improve in the coming years.
I’m confused. The feature you describe is logical properties, which is exactly what the section you’re talking about is about. I don’t know what you mean by “manually calculated”.
Indeed! I guess I was confused by the organization of the example[0] for that section. The physical section being first obscures all the examples of the logical block style so all you see is variations of:
Looking back, my comment was more due to my own inability to visualize the simplicity of the logical properties as presented vs. what I've seen in other demonstrations and realizing that I lost sight of a resource that I think would be a helpful reference for myself as this new method becomes commonplace.
But this mixes content and style, so would be difficult to update (for non technical users etc.). Besides, it's fairly bad practice and won't scale to sentences well.
> Have you ever wondered how Chrome knows to ask you if you’d like a web page’s content to be translated?
How difficult is it to automatically detect the language? Probably naive but how far can you get by counting how many common words on the page are from a particular language (e.g. "the of then because I" for English and "je pour des les" for French)?
Google translate already has an auto-detect language feature.
I suspect it would be reasonably simple to add a "Translate this" button somewhere in Chrome (perhaps it's already there, I don't have Chrome installed on this machine.)
Perhaps automatically doing it for every site would be a little bit intrusive on the privacy front.
That's coming! It's not implemented in any browsers yet, but it should be soon. It was introduced in "CSS Intrinsic & Extrinsic Sizing Module Level 4".
It will use the property "aspect-ratio" with values like "16/9" or "1/1".
In modern browsers, whether or not lang is provided, the browser is smart enough to set the correct encoding / glyphs.
In the olden times, browser was not smart (ie: still set the wrong charset). You may have come across for example, an asian site (japanese, korean, chinese), and see a lot of text with "?"'s and "□"'s like so ??? □□□ in 90's and common still through beginning 2000's. Glad these days you don't have to spend time trying to match it up, but it was available readily in browser's menu for you to match.
I think here, the lang tag is useful so you can explicitly tell the browser. And if you are designing the page, you have more control for localization to target by the international code.
I think you are confusing language (English, French, Arabic, etc) with encoding (ASCII, UTF-8, UTF-16, Latin1, etc).
You do sometimes see mojibake in web pages, (question marks and □□ in place of the real text). These are caused by incorrect encodings. The web server tells the browser which encoding to use using the HTTP header Content-Type or using the <meta charset="UTF-8"> HTML element. You should always set the encoding rather relying on the browser guessing.
I was referencing asian languages as one set of examples for encoding. Of course setting the encoding is important (whether html or within a program dealing with strings - I had my share of hard bugs only to realize just that, another story)!
Here I was just sharing my encounters of the browser rendering ?? and □□ character marks, because it did not know. The browser had a "Auto Detect" charset mode, so one always had to toggle (and remember to revert back again when viewing another page). For the very reason you have indicated, "should always set", but in those days (and maybe even today), not always set.
[0] MDN reference: https://developer.mozilla.org/en-US/docs/Web/CSS/ruby-positi...
[1] W3C in-depth article: https://w3c.github.io/i18n-drafts/articles/ruby/styling.en.h...
[2] https://www.w3.org/International/articles/ruby/markup.en