spotify> For example it is hard to see the difference between Ω and Ω even thoug...

bloak · on Aug 21, 2018

Unicode was originally proposed as a universal character set to replace all existing character sets. For it to have any chance of acceptance it had to be possible to convert from JIS/whatever into Unicode then back again without any loss of information. So if there were any daft duplicates in legacy character sets those had to be duplicated in Unicode. I don't know if that explains those three physical units, but that's what I'd guess happened.

akira2501 · on Aug 21, 2018

> So...what's special about Ohms, Kelvins, and ångström

Nothing other than misguided thinking in the early versions of the standard.

The other problems with these special symbols is that if you call tolower() or similar on them they'll return the "normal" character they're based off of. So toupper(tolower(char)) != char.

r_c_a_d · on Aug 21, 2018

Does tolower() or toupper() even make sense with general unicode characters? I wouldn't expect it to... but I've never really thought about it before :-)

tialaramex · on Aug 21, 2018

Mostly, we're used to defining tolower() and toupper() to return either a lower or upper case variant if one exists, otherwise you get back what you put in. For most Unicode codepoints no such variants exist and so you just get back whatever you fed in. Some of the alphabets have uppercase/ lowercase, but obviously most writing systems don't do this.

However, lower(upper(X)) is not defined to be the same as lower(X), and there's no promise that meddling with a string transforming with lower() or upper() does what you hoped because that isn't how language actually works (e.g. in English the case sometimes marks proper nouns so "May" is the Prime Minister of the UK, but "may" is just an auxiliary verb).

Where standards tell you something is case-insensitive, but it's also allowed to be Unicode rather than ASCII, you can and probably should "case crush" it with tolower() and then never worry about this problem. In a few places you have to be careful because a standard says something in particular is case-insensitive, but not everything that goes in that slot is case-insensitive. For example MIME content type names like "text/plain", "TEXT/PLAIN" and "Text/Plain" are case-insensitive, but

multipart/mixed; boundary="ABCDEFGHIJKL" multipart/mixed; boundary="abcdefghijkl" multipart/mixed; boundary="AbcDefGhiJkl"

... declare three different boundary tokens, and none of them matches the sequence abCdeFghIjkL.

majewsky · on Aug 21, 2018

What's worse, tolower() and toupper() are locale-dependent. In most locales,

  tolower("I") = "i"

but in Turkish,

  tolower("I") = "ı"

Same in the other direction, because there is also a large I with dot.

gotodengo · on Aug 21, 2018

At this point it's a backwards compatibility issue. Like you say for Ohm they now recommend using the omega symbol[1] but there's still code out there using the Ohm symbol.

Solving that wouldn't have helped in the Spotify case though since there's a ton of other edge cases like combining characters 'e' + ' ́' vs precomposed characters 'é' which still cause the need for an idempotent canonicalization of usernames.

Not to get too far from the topic at hand, but I came across the Spotify article earlier this week while looking to support Unicode usernames in an application. After consideration I've decided to just lock things down to ASCII for now. It's just too big a case to consider and there are bigger fish to fry.

[1] https://en.wikipedia.org/wiki/Ohm#Ohm_symbol