A few things I wish I could tell all newbies about staying out of trouble with Unicode:
* Use UTF-8 for external text, whenever possible. If your collaborators have other ideas, bribe them with tasty cookies or something, because this right here solves a lot of hassles. There are some circumstances in which a different encoding might have its advantages, but it is tremendously reassuring to be able to say "Ah, text! I shall decode it as UTF-8!" and be right. This has the advantage of being compatible with ASCII input, and avoiding the perennial UTF16/UCS2 confusion.
* Make it explicit that you're using UTF-8. For example, if you're making a web page, be sure to set "Content-Type: text/html; charset=utf-8" in the HTTP headers, to make the browser's content encoding detection trivially correct.
* When dealing with strings in your favorite programming language, always know whether it's an array of Unicode code-points, or of bytes in UTF-8, or some third messed-up thing. Not all strings are the same kind of thing! Unless your programming language has this distinction enforced by its type system, of course.
* Be aware when you're crossing the boundary between Unicode code points and Unicode in some external encoding. Decoding can fail, so be prepared. It's best to reject invalid text as early as possible.
* When in doubt, use other people's code for Unicode handling. For most of the crazy crap you run into in the wild, there is well-tested crazy-crap-handling code.
The last point should be stronger : Unicode manipulation is complex. Some things that may appear trivial, like comparing two strings in a case independant way, aren't at all. A developper who makes its fast conversion function is simply adding a bug.
And really. If a developer doesn't understand everything regarding Unicode, tell him it's simply mandatory to use UTF-8. Other encodings are for people who really really know what they're doing.
> * When dealing with strings in your favorite programming language, always know whether it's an array of Unicode code-points, or of bytes in UTF-8, or some third messed-up thing.
What is an 'array of Unicode code-points'. I actually reread the article thinking that my unicode chops were getting rusty if I didn't know what that meant, but after rereading it, I still don't know what you mean.
[I wrote a half a dozen 'do you mean _____' suggestions, but I think I'll just let you explain. :) ]
Sorry for the unclear wording; I meant "an array of fixed-width values, each of which stores a single Unicode code-point". For example, any program that stores text in UCS-4: an array of 32-bit values, each holding a single code point.
This is how way too many people think that UTF-16 works: each code point gets 16 bits, and you have an array of them, so you can count characters, do O(1) random indexing, and so on. This is a harmful myth, of course. Code points do not correspond neatly to glyphs, and UTF-16 is a variable-width encoding, although most people don't use code points outside the basic multilingual plane, so a lot of people can get away with pretending that it's fixed-width, until they can't.
The most maddening instance of this confusion that I've seen so far is in Python's Unicode string handling. Guess what happens when you run this Python code to find the length of a string containing a single Unicode code-point:
print len(u"\U0001d11e")
This will print either 1 or 2, depending on what flags the Python interpreter was compiled with! If it was compiled one way (the default on Mac OS X), then it uses an internal string representation that it sometimes treats as UTF-16 and sometimes as UCS-2. With another set of flags (default on Ubuntu, IIRC) it will use UCS-4 and do the Right Thing. For the same task, Java gets the string length right, but requires you to explicitly write the string as a UTF-16 surrogate pair: "\uD834\uDD1E".
The redeeming virtue that both share is that they will do the right thing if you treat everything as variable-width encoded, use the provided methods for encoding and decoding, and avoid the hairy parts left over from when people naively assumed that UTF-16 and UCS-2 were the same thing and that they ought to be enough for anybody.
Amusingly enough, Joel's company does not follow Joel's preaching.
I tried to convince Fog Creek to abandon the obsolete, non-standard and proprietary 8-bit Windows-CP1252 character encoding in the E-mails that FogBugz sends. They refused, reason given: "joelonsoftware.com is a blog. Fog Creek Software is a business."
FWIW, CP1252 was picked as a kind of super-ascii (or super iso-8859-1, if you prefer, it's almost a strict superset). I'm not aware of any encoding aware software that is unable to process CP1252.
And HTML5 actually requires "ISO 8859-1" to be interpreted as Windows-1252.
(Some old Mac/Linux browsers resisted doing so and would show 'moronic' private 1252 characters as missing characters, even if a correct font was installed.)
Received: from mxny1.fogcreek.com (mxp11.fogcreek.com [64.34.80.172]) by […]
Received: from hbny4 (hbny4.fogcreek.local [10.2.0.48]) by mxny1.fogcreek.com (Postfix) with ESMTP id 2386295B33 for […]; Tue, 10 Jan 2012 19:18:45 -0500 (EST)
Priority: normal
Mime-Version: 1.0
Content-Type: text/plain; charset=windows-1250
Content-Transfer-Encoding: quoted-printable
Message-Id: <20120111001845.2386295B33@mxny1.fogcreek.com>
It sends CP1252 emails if all the characters used fit. If they don't, it uses UTF8. That way, people with old mailers can still read them (more realistically, people reading the raw email can read them out of the database or from traffic dumps), because .NET has a tendency to base64 encode the body the instant you set charset to UTF8, and I don't unbase64 in my head.
Try including the funny character of your choice in an email. It should show up just fine.
I'd like to know what they mean by that. Are they saying that Joel's blog is merely his opinion and that business is not directly subject to his whims and opinions, or are they saying that the advice only applies to blogs and non-serious things while businesses have other priorities?
What is Joel preaching? To use UTF-8? I don't think so. This post is explaining the difference between character sets and encoding. It was very helpful for me.
I remember when this was first posted. RSS was new, Movable Type 2.x was hot shit, the future of the web was XML and real hackers were parsing it with regular expressions and duct tape. Everything was supposed to be Unicode, except the webdev community was 99.999% NorthAmerican and everybody knew plaintext "is supposed to be ASCII" anyway.
Fast forward 8 years... I've joined Stack Overflow about a week ago and my first accepted answer was about Unicode and string handling in Python 2.x. Just yesterday I was thinking I should reread this exact post, to refresh a few points and keep shooting fish in that barrel. I guess I should be grateful Python 3 has gone full-Unicode... except that now people ask how to emit ASCII with it. And the Python community is among the most clued-up on the subject (probably as a reaction to how bad it was handled in 2.x).
"EBCDIC is not relevant to your life. We don't have to go that far back in time."
I wish this was the case. And for 99% of you it might be. But there's still lots of EBCDIC (and antique COBOL) out in the wild that increasingly has to interact with the real world. One of the chunks of telephony software I'm responsible for has to interface with an EBCDIC system, requiring some basic translation to UTF-16. (And then eventually has to pass some of this now UTF data back to the EBCDIC system.) Not exactly difficult until you get to the various numeric encodings and no one can decide if they want Binary-Coded Decimal, Pic9s, pure int, signs, which endianess (if they even know about endianess), et cetera.
EBCDIC wasn't dead in 2003 when Joel wrote this and it certainly (and unfortunately) isn't dead in 2012.
Yup, if you have to work with old iSeries servers, or just about anything dealing with old retailers in eCommerce, you've probably touched it one time or another.
Fortunately, the recode program makes it easy to switch between different encodings.
I don't mind the repost. Every time I read this I've learned more, gotten more context and encountered more problems with character encoding, so every time I get more and more out of the essay.
I wouldn't even think about reading it unless I saw it posted on HN. However as long as I've got to deal with character encoding problems like this:
FTA: "The traditional store-it-in-two-byte methods are called UCS-2 (because it has two bytes) or UTF-16 (because it has 16 bits), and you still have to figure out if it's high-endian UCS-2 or low-endian UCS-2."
Written in 2003, 7 years after UCS-2 was obsoleted by UTF-16 because Unicode 2.0 was too big for just 16 bits. UCS-2 is not UTF-16. Windows NT wasn't UTF-16 either, IIRC.
Of course, Microsoft kept telling everyone that 16bits-per-character is Unicode.
If you do a lot of string manipulations, you're better off with either UTF-32 or dumbified (16-bit fixed) UTF-16, otherwise you will have to count characters from the beginning of the string every time you need to access the nth character within the string. Moreover, if you deal with a text with a lot of characters between 0x0800 and 0xFFFF (e.g. East-Asian languages) you're much better off with UTF-16 as you will save a whole byte per character.
How often do you need random access by code point? In every case I can recall, if random access by UTF-8 offset wasn't the right thing (which it usually was), then random access by code point wouldn't be either. Almost all string offsets you'll ever have to deal with come from having a program look at a string, and in that case, you can just use UTF-8 byte offsets. What sort of text processing are you doing where this isn't the case?
As for East Asian text, you have a point: it will usually be shorter in UTF-16 than UTF-8. Before making this decision, though, ask yourself how much that extra space is worth to you. Is it worth dealing with possible encoding hassles? (The answer to this may be yes, but it's a question that should be asked.) Also, on a lot of data, there are many characters from the ASCII range mixed in with the East Asian text. I did an experiment a while back where I downloaded some random web pages in Chinese, Japanese, Korean, and Farsi, and compared their size in UTF-8 and UTF-16. Because of the amount of those documents that was HTML tags, all four pages ended up smaller in UTF-8.
Maybe I'm missing something but I'm not sure if I understand you question. How would you write even a simple parser without being able to access the contents of the string randomly by the character index?
You can access the contents randomly. Just use the index in bytes, rather than characters. Let's look at a really simple parsing task as an example: splitting tab-delimited strings, in UTF-8. First, you find the indexes in the string (in bytes) of the commas, then you use those to split out substrings. This is exactly the same code you would use with plain ASCII text, and in fact a lot of programs designed to process ASCII work unmodified with UTF-8.
Another example: for a lex-and-yacc type of parser, you can use regular expressions to split a string into tokens, and then use a parser on that stream-of-tokens representation. None of this requires character indexing; just byte indexing.
I think both of those things are both true. I'm guessing there are currently, only 1.1M code points defined, and these fit in 4 bytes. However, there are currently-unallocated code points that go higher which could occupy he remaining 2 bytes that can be used with UTF-8.
Yes, the original UTF-8 spec was much more forgiving. The most recent spec, RFC 3629, restricted the range of code points and made the decoding of invalid sequences a MUST NOT requirement: http://tools.ietf.org/html/rfc3629#section-12
Sometimes it misses, although in some cases I think it might be related to age: after a certain time it's fun to revisit and given that the base changes no doubt some current readers weren't around when it was last posted.
Perl actually has one of the best Unicode supports (second only to ICU). This guy worked on it, so he's listing all the edge cases (bulk of the post), explaining how to get rid of backward-compatible brokenness (and maybe Perl needs a quicker way to saner defaults), and explaining how to do work that requires explicit handling.
I suspect the reason so many people don't know about is because they don't care enough. (I don't care enough). This is not an interesting enough topic for most devs to bother with unless something has broken or isn't working write and Unicode "problems" are suspected to be the reason.
…and this is exactly why this essay should be regularly reposted. If you write any software that is used by people other than yourself, you should care, because they will most likely find it broken.
Also, if you write software for money, ignoring Unicode is just plain incompetent. I can't count the number of times I've had packages shipped to "Biały Kamień" or "Bia&#322;y Kamie&#324;" street instead of "Biały Kamień". If you expose even a single name or address field in your software, you need to handle Unicode.
And yes, you need to handle Unicode even if you want to limit yourself only to the US market. Your customer might have an umlaut or an accent in his name.
> If you write any software that is used by people other than yourself, you should care,
I agree and I should know better. I even opened the link. But after reading for 10 seconds, I closed it. I just wasn't motivated enough. Even had to fix bugs and issues related to this just last month. But every time I do, I just go and find out enough to solve the problem and then never really dig deeper. Don't really know why it is this way.
You're just being more honest than most developers by admitting it. A large percentage of programmers feel that way. In fact Paul Graham shares your attitude and, consequently, limited awareness of internationalization issues. His "we can always tack it on later" approach to Unicode use in Arc combined with his "Arc strings are just a list of characters" claim gave it away. For many years, I made good money cleaning up after developers who felt just like you guys who were, nevertheless, brilliant at other aspects of programming. There's a lot to be said for using a team of specialists in different areas.
How did you manage to learn anything in your profession then? Really, that is a short and well-written post, which shouldn't take more than minutes from your schedule.
> How did you manage to learn anything in your profession then?
I cared about it. I will forget to eat and sleep if I am studying something I am motivated about. The rest I just learn as needed only if am forced to (read "when stuff breaks"). Not a very good approach I guess.
If you don't care much about Unicode, then either you haven't written any complicated software that deals with other people's text data, or you have been scrupulous in avoiding Unicode problems, or you haven't encountered any of the many Unicode problems in your code. Personally, I've spent way too much time dealing with subtle Unicode bugs, in my own code and especially in other people's code, and so I care vehemently. This is a major source of headaches.
> you haven't written any complicated software that deals with other people's text data
Actually you are right. Most input and output of our software is not text. And we mostly work for the US military (so UI is in English). This probably explains why. However in general I feel I should know more about. It is sort one of those things like when we were talking about interview questions and someone brought up Pascal's triangle. I felt I should have known about what it is, but I couldn't remember. It is not something I need to know for work, but rather something I felt embarrassed for not knowing.
Many presumably know about the issue, but reluctantly decide that the scope of their project doesn't justify the effort in worrying about it, especially when there won't conceivably be much non-English text interacting with their project. This is particularly true if an integral part of your environment screws things up: e.g. your programming language, web server, targeted device/browser, etc. does text wrong.
While you're enjoying this classic article (and I do re-read it every time it's posted), you may wish to also take in Tim Bray's - in my opinion equally classic - article about Unicode.
I will just add - Unicode isn't hard - until you have to display or otherwise deal with Right-To-Left languages and their invisible control characters.
I find out that even working as a user with text that is right-to-left in places and in other places left-to-right is hard. It's quite easy to deal with it programatically, but only before you have to display it somehow.
I can't say it qualifies as the absolute minimum everyone should know, but I highly recommend reading the book Unicode Demystified if you're the kind of programmer who wants to understand how everything works.
When I was beginning to learn HTML, this article helped me get a grasp over encodings etc... Thanks @spolsky. I started following your blog after reading this :)
"When I discovered that the popular web development tool PHP has almost complete ignorance of character encoding issues, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is enough."
This is FUD. It's really not difficult to use the mb_* family of functions to deal with unicode in PHP. If you're writing a new app, this is trivial. If you're working with an existing app, it's obviously more difficult, but far, far from impossible.
Of course you can manipulate utf-8 in PHP, else it would have died long ago. But as a matter of fact PHP 6 was a failure, and unicode is still an afterthought that you must hack around with special functions in PHP5.
Well, there's almost no string handling built-in to PHP as a language at all, it's just provided as part of the standard library. So instead of using one part of the library (the standard string functions), you use another (the mb_* string functions.) I don't really see how that's a hack. The one thing that could go wrong is if you don't have the mb_* extension, but that's easily rectified, and hasn't been something I've seen in the wild in the past few years.
I'm not saying PHP's UTF-8 handling is great by any means, but the claim was that it's "nearly impossible." I'm suggesting that one should instead say "Building a UTF-8 compliant site in PHP is annoying, and requires more work than one would prefer, but if you do a bit of research, it's not that hard."
Because if you switch encoding you need to modify the code. All languages actually supporting utf-8 use the very same functions whatever the encoding is, eventually you simply need to declare that you're using utf-8 but that's all.
* Use UTF-8 for external text, whenever possible. If your collaborators have other ideas, bribe them with tasty cookies or something, because this right here solves a lot of hassles. There are some circumstances in which a different encoding might have its advantages, but it is tremendously reassuring to be able to say "Ah, text! I shall decode it as UTF-8!" and be right. This has the advantage of being compatible with ASCII input, and avoiding the perennial UTF16/UCS2 confusion.
* Make it explicit that you're using UTF-8. For example, if you're making a web page, be sure to set "Content-Type: text/html; charset=utf-8" in the HTTP headers, to make the browser's content encoding detection trivially correct.
* When dealing with strings in your favorite programming language, always know whether it's an array of Unicode code-points, or of bytes in UTF-8, or some third messed-up thing. Not all strings are the same kind of thing! Unless your programming language has this distinction enforced by its type system, of course.
* Be aware when you're crossing the boundary between Unicode code points and Unicode in some external encoding. Decoding can fail, so be prepared. It's best to reject invalid text as early as possible.
* When in doubt, use other people's code for Unicode handling. For most of the crazy crap you run into in the wild, there is well-tested crazy-crap-handling code.