Hacker News new | past | comments | ask | show | jobs | submit login

spotify> For example it is hard to see the difference between Ω and Ω even though one is obviously a Greek letter and the other is a unit for electrical resistance and in unicode they indeed have different code points

This surprised me, because the correct Ohm symbol is in fact the Greek letter, so why does Unicode have a special code point for it?

Unicode also does this for Kelvin, where the correct symbol is a capital K but Unicode has a separate code point for it, and for ångström where the correct symbol is a capital A with a circle above it but Unicode gives it a separate code point.

They do not do this for Newtons (capital N), Joules (capital J), Watts (capital W), or anything else I can see where the standard symbol is an ordinary letter or group of letters.

In all three of these cases the Unicode Consortium recommends NOT using the separate code point.

So...what's special about Ohms, Kelvins, and ångström that (1) gives them their own place in Unicode, and (2) what is the point since we are not, according to the Unicode Consortium, supposed to use them?




Unicode was originally proposed as a universal character set to replace all existing character sets. For it to have any chance of acceptance it had to be possible to convert from JIS/whatever into Unicode then back again without any loss of information. So if there were any daft duplicates in legacy character sets those had to be duplicated in Unicode. I don't know if that explains those three physical units, but that's what I'd guess happened.


> So...what's special about Ohms, Kelvins, and ångström

Nothing other than misguided thinking in the early versions of the standard.

The other problems with these special symbols is that if you call tolower() or similar on them they'll return the "normal" character they're based off of. So toupper(tolower(char)) != char.


Does tolower() or toupper() even make sense with general unicode characters? I wouldn't expect it to... but I've never really thought about it before :-)


Mostly, we're used to defining tolower() and toupper() to return either a lower or upper case variant if one exists, otherwise you get back what you put in. For most Unicode codepoints no such variants exist and so you just get back whatever you fed in. Some of the alphabets have uppercase/ lowercase, but obviously most writing systems don't do this.

However, lower(upper(X)) is not defined to be the same as lower(X), and there's no promise that meddling with a string transforming with lower() or upper() does what you hoped because that isn't how language actually works (e.g. in English the case sometimes marks proper nouns so "May" is the Prime Minister of the UK, but "may" is just an auxiliary verb).

Where standards tell you something is case-insensitive, but it's also allowed to be Unicode rather than ASCII, you can and probably should "case crush" it with tolower() and then never worry about this problem. In a few places you have to be careful because a standard says something in particular is case-insensitive, but not everything that goes in that slot is case-insensitive. For example MIME content type names like "text/plain", "TEXT/PLAIN" and "Text/Plain" are case-insensitive, but

multipart/mixed; boundary="ABCDEFGHIJKL" multipart/mixed; boundary="abcdefghijkl" multipart/mixed; boundary="AbcDefGhiJkl"

... declare three different boundary tokens, and none of them matches the sequence abCdeFghIjkL.


What's worse, tolower() and toupper() are locale-dependent. In most locales,

  tolower("I") = "i"
but in Turkish,

  tolower("I") = "ı"
Same in the other direction, because there is also a large I with dot.


At this point it's a backwards compatibility issue. Like you say for Ohm they now recommend using the omega symbol[1] but there's still code out there using the Ohm symbol.

Solving that wouldn't have helped in the Spotify case though since there's a ton of other edge cases like combining characters 'e' + ' ́' vs precomposed characters 'é' which still cause the need for an idempotent canonicalization of usernames.

Not to get too far from the topic at hand, but I came across the Spotify article earlier this week while looking to support Unicode usernames in an application. After consideration I've decided to just lock things down to ASCII for now. It's just too big a case to consider and there are bigger fish to fry.

[1] https://en.wikipedia.org/wiki/Ohm#Ohm_symbol




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: