Hacker News new | past | comments | ask | show | jobs | submit login

Sorry for the unclear wording; I meant "an array of fixed-width values, each of which stores a single Unicode code-point". For example, any program that stores text in UCS-4: an array of 32-bit values, each holding a single code point.

This is how way too many people think that UTF-16 works: each code point gets 16 bits, and you have an array of them, so you can count characters, do O(1) random indexing, and so on. This is a harmful myth, of course. Code points do not correspond neatly to glyphs, and UTF-16 is a variable-width encoding, although most people don't use code points outside the basic multilingual plane, so a lot of people can get away with pretending that it's fixed-width, until they can't.

The most maddening instance of this confusion that I've seen so far is in Python's Unicode string handling. Guess what happens when you run this Python code to find the length of a string containing a single Unicode code-point:

    print len(u"\U0001d11e")
This will print either 1 or 2, depending on what flags the Python interpreter was compiled with! If it was compiled one way (the default on Mac OS X), then it uses an internal string representation that it sometimes treats as UTF-16 and sometimes as UCS-2. With another set of flags (default on Ubuntu, IIRC) it will use UCS-4 and do the Right Thing. For the same task, Java gets the string length right, but requires you to explicitly write the string as a UTF-16 surrogate pair: "\uD834\uDD1E".

The redeeming virtue that both share is that they will do the right thing if you treat everything as variable-width encoded, use the provided methods for encoding and decoding, and avoid the hairy parts left over from when people naively assumed that UTF-16 and UCS-2 were the same thing and that they ought to be enough for anybody.




Thanks, that's a great clarification.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: