I've glanced at internationalization API's at various times over the years, and ...

guns · on Aug 11, 2012

> You have encodings, Unicode, ASCII, UTF-8, ISO 9660, Latin-1, code pages, UTF-16, byte order masks, gettext macros, po files, ... the terminology and model of the problem domain are extremely complex and difficult to understand.

In the beginning, there was ASCII [1]. It was a simple encoding that mapped a byte stream to standard American letters, numerals, punctuation marks, as well as some common non-printing control codes.

ASCII only used the lower 7 bits of the 8-bit byte, reserving the upper 128 positions for any non-American characters needed for national encodings.

And indeed, many dozens of national character encodings appeared that used ASCII for its lower 128 positions and implemented their own character table in the upper 128. One very popular encoding was Latin-1 [2]. This became the standard encoding in much Western software because it adequately handled the most widely used Western languages.

One major problem with these 8-bit national encodings is that the upper 128 codes are mutually exclusive with one another. Confusingly, they almost all shared the base 128 ASCII codes, so programmers and users began to equate "plain text" and "sane encoding" with 7-bit ASCII, since one could effectively communicate universally by restricting the characters used to those in the printable ASCII table.

As it became clear that the proliferation of 8-bit encodings was untenable, there emerged Unicode. Unicode is not an encoding, but a standard that provides a table of universal code points, along with some recommendations about how to combine and display certain code points. [3]

Unicode is implemented in the modern day by UTF-8, UTF-16, and UTF-32, which are primarily distinguished, as you might guess, by the base size of the code unit.

UTF-32 is a simple encoding that simply maps every Unicode code point to a 32-bit sequence. This is simple to parse, but potentially very wasteful, so is rarely used.

UTF-16 uses 16-bit code units, and is able to simply translate the most commonly used portion of Unicode, the Basic Multilingual Plane. For code points above U+ffff, a scheme is used to span the code point along two code units. This encoding is used frequently in Windows, and in Java.

UTF-8 is a variable length Unicode encoding like UTF-16, but defaults to a small one-byte code unit and has a famously elegant algorithm, so it appeals strongly to miserly Unix hackers.

> For example, one time I downloaded some tarball (I forget what it was) that had a few bytes of binary garbage at the beginning of every file. After some research I found out that it's called the BOM and has something to do with international text, and I ended up having to WRITE A SCRIPT WHICH GOES THROUGH AND DELETES THE FIRST FEW BYTES OF EVERY FILE IN A TREE in order to use the tarball's contents.

The Byte Order Mark is clunky solution to the fundamental problem of divining the character encoding of an arbitrary byte stream. It's great if all your tools transparently support it, but annoying if not. However, some sort of convention or metadata is necessary to correctly encode your data. Python, Ruby, and other scripting languages have begun to coalesce around the magic encoding comment for source files (i.e. `# encoding: utf-8` as the first or second line).

Most everybody falls back to ASCII if no encoding is specified, or cannot be inferred from the stream itself. The better fallback is UTF-8, because breakages like yours are less likely to occur, which is why it is encouraged as the default system encoding in most cases.

> Another time, I downloaded some Java source which contained the author's name in comments. The author was German and his name contains an "o" with two dots over it. That was the only non-ASCII character in the files. Eclipse and command-line javac WOULD NOT PROCESS THE FILE and I ended up removing his name from all comments; after that it compiled without a hitch. This was the official Oracle (then Sun) javac. A fricking SOURCE TO BYTECODE COMPILER SHOULD NOT DEPEND ON YOUR SYSTEM'S NATIONALITY SETTINGS -- OR ANY LOCAL SYSTEM SETTINGS! -- TO DO ITS JOB. But it does.

These tools likely have ways of setting the encoding without inheriting from the environment, but they do fall back on the environment as a simple convention.

The trouble is that there is no reason any longer to assume that all text _must_ be 7-bit ASCII. Unix and programming languages are evolving to handle this new multilingual digital world. The only obstacle that really remains are programmers, so I think it's fair to spend a little time learning the basics of the subject.

[1]: There were other antediluvian encodings (like EBCDIC)

[2]: a.k.a. ISO-8859-1. Windows used a slightly modified version of this and called it Windows-1252 in order to complicate matters

[3]: The actual display of composite glyphs is left the implementor. For instance, both ready-made composite glyphs like é are provided, as well as a "non-spacing" acute accent mark

masklinn · on Aug 11, 2012

And you didn't even touch on fixed and variable-width asian character sets, like Shift-JIS (variable 1 or 2 bytes) or Big5 + extensions (ETEN or CP950, fixed 2 bytes)

derleth · on Aug 11, 2012

> UTF-8 is a variable length Unicode encoding like UTF-16, but defaults to a small one-byte code unit and has a famously elegant algorithm

... that is 100% compatible with ASCII, and, therefore, is the only Unicode encoding scheme you can safely use in filenames and to send text through applications that don't know about Unicode at all.

The primary reason is that in UTF-32 and UTF-16, the byte 0x00 may appear in the encoding of characters other than '\0'. This is not something pre-Unicode applications can deal with, because in ASCII the byte 0x00 always means '\0', which was always used for string termination.

Therefore, any application that processes ASCII and leaves non-ASCII alone (common in the real world) is instantly compatible with UTF-8.

Here's a fascinating discussion about character encodings in filenames, including comments from Linus Torvalds and Theodore Ts'o:

http://yarchive.net/comp/linux/utf8.html

The upshot is, Linux, like most if not all Unix-like OSes, speaks bytestreams, and doesn't try to interpret characters. The only rules you need, therefore, are that filenames can't contain the byte 0x2f ('/' in ASCII) except as a path separator and can't contain the byte 0x00 at all. Therefore, you need a character encoding that will not use the bytes 0x2f or 0x00 to represent characters other than '/' and '\0', respectively. The only Unicode encoding that meets that constraint is UTF-8.

Also: The elegant algorithm means that it's statistically highly, highly improbable that any text that can be interpreted as valid UTF-8 isn't UTF-8. Given that it's entirely valid to treat ASCII as UTF-8, defaulting to UTF-8 is, as you say, a good thing to do.