People still flame how MS uses 16-bit characters, which IMO is a bit of an unfair criticism. (Full disclosure: I used to work for them. But I was never a koolaid drinker.) I don't think they have any religious aversion to UTF-8, it's just history. By adopting Unicode as early as they did, they made this decision before UTF-8 existed and they stuck with it.
Probably the bigger crime is not switching to UTF-8 as the default "ANSI"/"multi-byte" codepage (to use the Windows terms). This means C programmers who are not Windows experts often end up writing non-Unicode-safe software because they expect every string to be char*.
(Also as has been mentioned UTF-16 is not fixed. And even in UTF-32 there are instances where a single glyph takes multiple code points - the decomposed accent marks are the one example I know, possibly there are others?)
IMHO it's completely fair to criticize others for the externalities that they create. UTF-16 is a design flaw, and Microsoft repeatedly decides to release a new version of Windows that fails to meaningfully address that flaw. To date, MS hasn't communicated a vision for a future of Windows where nobody has to deal with UTF-16, we can only conclude that their vision is that for the next 100 years, the rest of us will still be paying to support their use of UTF-16.
This isn't just about UTF-16 either. Think about how much IE6 has cost everyone who's made a website in the last 10 years. Microsoft's penchant for building and marketing high-friction platforms is a huge drain on innovation (and a big barrier-to-entry for people who want to learn programming), and the company deserves an enormous amount of criticism for it.
Can you explain how utf-16 is a design flaw? This sounds totally kooky to me. Are you thinking of the more limited UCS-2? Utf-16 represents the same set of chars as utf-8. It's true that lots of errors can occur when people assume 1 wchar = 1 codepoint = 1 glyph but utf-8 has similar complexity and I've seen plenty of people screw it up too. It sounds much more to me like you are saying that "not working how I am used to" is the same as "design flaw".
Edit: put another way, the NT kernel since its inception represented and continues to represent all strings as 16 bit integers, saying that they need to "meaningfully address this" is like saying Linux should migrate away from 8-bit strings; there is no reason to do it. A lossless conversion exists and I fail to see it as a big deal, it's just a historical thing because NT's initial development predates UTF-8.
Imagine a world where we don't have to convert between multiple charsets, or even think about them, beyond "text" vs "binary". That world is perfectly achievable, it's basically already happened outside the Microsoft ecosystem.
OSX and Linux default to using UTF-8 everywhere, Ruby on Rails is UTF-8 only, the assumed source encoding in Python 3 is UTF-8, URI percent-encoding is done exclusively using UTF-8, HTML5 defaults to UTF-8... The list goes on and on. "UTF-8" is becoming synonymous with "text".
In an all-UTF-8 world, if you want to build a URI parser, or a CSV parser, or an HTML parser, or really anything that does any kind of text processing (except rendering), you can just assume ASCII and everything will work as long as you're 8-bit clean. Even non-US codepages are all more-or-less supersets of ASCII. The only major exceptions are EBCDIC and UTF-16.
Because of UTF-16, we have to build charset conversion/negotiation into every single interface boundary, where we could otherwise just be 8-bit clean except in places where text actually has to be rendered.
It's completely unnecessary friction that undermines one of the big motivations for creating Unicode in the first place: coming up with a single standard for text representation, so that we don't have to deal with the mess of handling multiple charsets everywhere.
None of these decisions is a big deal in isolation, but it's a death by 1000 cuts. When you have mountains of unnecessary complexity, it only serves to make programming inaccessible to ordinary people, which is how we get awful policies like SOPA, software patents, and the like.
There's no inherent, fundamental reason in the universe other than momentum that says you've got to use an 8-bit encoding. You could serve to be a bit more honest about how arbitrary that call really is and where it comes from. In fact many higher level languages even outside the MSFT ecosystem are happily using UTF-16 everywhere natively without problems, they just do a small conversion step at a syscall to run on your beloved Unix. AFAIK the JVM is working this way for example. This "imagine a world" game can just as easily go the other way: "imagine a world in which everything is UTF-16 ...". I think it's disingenuous to say that it matters one way or the other, that your favorite is better and all others, even if they are a 1-to-1 mapping with your favorite, constitute a "design flaw". Because information theory does not care; it is just a different encoding for the same damn thing. In truth it makes little difference as long as you are consistent, and you are whining simply because not everybody picked the same thing as your favorite.
Your claim that UTF-16 is a "design flaw" has not been validated, just that you don't like encodings ever to change and UTF-16 isn't your favorite. It's very hard to consider that anything other than whining. I thought we software types are supposed to be big on abstractions and coming up with clever ways of managing complexity? It seems rather rigid to say you'll only ever deal with one text encoding.
Lastly even in 2013 this is a total lie:
> OSX and Linux default to using UTF-8 everywhere,
If that's true then how come on virtually every Unix-like system I've set up for the last ~15 years one of the first things I've had to do is edit ~/.profile to futz around with LC_CTYPE or whatever to ask for UTF-8? I am pretty sure every Unix-like system I have set up gave me Latin1 by default, even quite recently.
I think you are underestimating the extent to which UTF-8 is a crude hack designed to avoid rewriting ancient C programs that did very much the wrong approach to localization. There was a time before UTF-8 existed and became popular when it was a fairly common viewpoint that proper Unicode support involved making a clean break with the old char type. You are right to say that UTF-8 "won" in most places but the fact that the NT kernel or the JVM use 16-bit chars reflects that prior history. I think a more mature attitude would be to accept this, that it came from a time and a place and is a different way of working, rather than call it "wrong".
> There's no inherent, fundamental reason in the universe other than momentum that says you've got to use an 8-bit encoding.
"Other than momentum"? That's a double-standard: Momentum is the only reason why UTF-16 is still relevant today. Ignoring momentum, UTF-8 still has a bunch of advantages over UTF-16, namely that endianness isn't an issue, it's self-synchronizing over byte-oriented communication channels, and it's more likely to be implemented correctly (bugs related to variable-length encoding are much less likely to get shipped to users, because they start to occur as soon as you step outside the ASCII range, rather than only once you get outside the BMP). What advantages does UTF-16 have, ignoring momentum?
I don't want to debate abstract philosophy with you, anyway. Momentum may be the reason, but there's no plausible way that UTF-16 is ever going to replace octet-oriented text. The idea that UTF-8 and UTF-16 are equivalent in practice is a complete fantasy, and I'm arguing that we should pick one, rather than always having to manage multiple encodings.
> I thought we software types are supposed to be big on abstractions and coming up with clever ways of managing complexity?
The best way to manage complexity is usually to adopt practices that tend to eliminate it over time, rather than adding more complexity in an attempt to hide previous complexity. It doesn't matter how "clever" that sounds, but it's generally accepted that it takes more skill and effort to make things simpler than it does to make them more complex.
> It seems rather rigid to say you'll only ever deal with one text encoding.
It seems rather rigid to say you'll only ever deal with two's-complement signed integer encoding.
It seems rather rigid to say you'll only ever deal with IEEE 754 floating-point arithmetic.
It seems rather rigid to say you'll only ever deal with 8-bit bytes.
It seems rather rigid to say you'll only ever deal with big-endian encoding on the network.
It seems rather rigid to say you'll only ever deal with little-endian encoding in CPUs.
It seems rather rigid to say you'll only ever deal with TCP/IP.
Why not eventually only ever deal with one text encoding? There's no inherent value in paying engineers to spend their time thinking about multiple text encodings, everywhere, forever.
Remember that we're talking about the primary interfaces for exchanging text between software components. Sure, there are occasions where someone needs to deal with other representations, but the smart thing to do is to pick a standard representation and move the conversion/negotiation stuff into libraries that only need to be used by the people who need them. This allows the rest of us to quit paying for the unnecessary complexity, and incentivizes people to move toward the standard representation if their need for backward compatibility doesn't outweigh the cost of actually maintaining it.
> I am pretty sure every Unix-like system I have set up gave me Latin1 by default, even quite recently.
Ubuntu, Debian, and Fedora all default to UTF-8, and have for several years now. You're going to have to name names, or I'm going to assume that you don't know what you're talking about.
> I think you are underestimating the extent to which UTF-8 is a crude hack designed to avoid rewriting ancient C programs that did very much the wrong approach to localization.
Really? How would you have done it so that UTF-16 wouldn't have broken your program? Encode all text strings as length-prefixed binary data, even inside text files? It's ironic that you say it's immature to call UTF-16 "wrong", but you've basically just claimed that structured text in general is "wrong".
Let's not forget that nearly every important pre-Unicode text representation was at least partly compatible with ASCII: ISO-8859-* & EUC-CN were explicitly ASCII supersets, Shift-JIS & Big5 aren't but still preserve 0x00-0x3F, and even EBCDIC, NUL is still NUL. Absent an actual spec, it was no less reasonable to expect that an international text encoding would be ASCII-compatible than it was to expect that such a spec would break compatibility with erverything. Trying to anticipate the latter would rightly have been called out as overengineering, anyway.
In that environment, writing a CSV or HTML parser that handles a minimal number of special characters and is otherwise 8-bit clean is exactly the right approach to localization.
Also, "ancient C programs"? Seriously? Are you really saying that C and its calling convention were/are irrelevant?
> There was a time before UTF-8 existed and became popular when it was a fairly common viewpoint that proper Unicode support involved making a clean break with the old char type.
Sure, and it was a fairly common viewpoint that OSI was the right approach, and that MD5 was collision-resistant. Then, we learned that all of these viewpoints turned out to be wrong, and they were supplanted by better ideas. UTF-8 became popular because it worked better than trying to "redefine ALL the chars".
> You are right to say that UTF-8 "won" in most places but the fact that the NT kernel or the JVM use 16-bit chars reflects that prior history.
The JVM isn't comparable, because its internal representation is invisible to Java developers. I can write Java code without ever thinking about UTF-16, and a new version of the JVM could come out that changed the internal representation, and it wouldn't affect me. Python used a similar internal representation, and recently did change its internal representation. Most Python developers won't even notice.
If NT used UTF-16 internally, but provided a UTF-8 system call interface, I wouldn't care. The problem is that, in 2013, people writing brand new code on Windows still have to concern themselves with UTF-16 vs ANSI vs UTF-8. This is a pattern of behavior at Microsoft, and that's what I'm criticizing.
> I think a more mature attitude would be to accept this, that it came from a time and a place and is a different way of working, rather than call it "wrong".
Look, I understand that mistakes will be made. My criticism isn't that the mistakes are made in the first place, but that Microsoft doesn't appear to have any plan to ever rectify them. The result is a consistent increase in friction over time, the cost of which is mostly paid for by entities other than Microsoft.
One could argue that Microsoft's failure to manage complexity in this way is one of the reasons why Linux is eating their lunch in the server market. Anecdotally, it's just way easier to build stuff on top of Linux, because there's a culture of eliminating old cruft---or at least moving it around so that only the people who want it end up paying for it.
As for your ad hominem arguments about "attitude" and "maturity", I could do without them, thanks. They contribute nothing to the conversation, and only serve to undermine your credibility. Knock it off.
> The idea that UTF-8 and UTF-16 are equivalent in practice is a complete fantasy,
Except that most obscure of details that they represent the same characters.
> Ubuntu, Debian, and Fedora all default to UTF-8, and have for several years now. You're going to have to name names,
In the room where I'm sitting now I have Debian, OpenBSD and Arch systems. All of these defaulted to Latin-1 when I installed them, in the current decade.
> I can write Java code without ever thinking about UTF-16
Patently false. A char in java is 16 bits. String.length() and String.charAt() use utf-16 units, meaning surrogate pairs are double-counted.
> As for your ad hominem arguments about "attitude" and "maturity", I could do without them, thanks.
I am sorry, sometimes I am blunt and overstated about these characterizations, but it really did seem like the shoe fits. You seem to have an impulsive defense of UTF-8 and an inability to see that there might be merits or tradeoffs in an alternative. Throughout all of this, I am not saying that UTF-8 is a bad encoding, I am just saying it's goofy to "attack" UTF-16 for being different.
Standard ruby is not utf8-only and not likely to be any time soon, because converting text into unicode is lossy due to han unification. So like it or not, you're going to need to deal with non-unicode text for a while yet.
- Unless I'm building a site that caters to CJK languages where Han unification is unacceptable, I'm not going to need to deal with non-unicode text. In 4 years of Rails development, I never set $KCODE to anything except 'UTF8'.
- Any solution to the Han unification problem is almost certainly going to happen within Unicode, or at least in some Unicode private-use area that can be encoded using UTF-8.
- As a last resort, it's still easier to use something like "surrogateescape"/"UTF-8b" to pass arbitrary bytes through the system than it is to support multiple text encodings at every single place where text is handled.
Very interesting, I guess it's just a function of not knowing much about that part of the world but I'd never heard of this problem before. http://en.wikipedia.org/wiki/Han_unification
Reading that article it's a wonder they didn't come up with a mode-switching character, similar to what the article says about ISO/IEC 2022 or (cringe) Unicode bi-di chars.
Unicode is one of those things that naively sounds great and everyone talks about as solving every problem, but it ends up having lots of warts...
You need two versions of functions. You need two modes for lots of text programs to use. It doubles the space for things like storing identifiers for programming languages, which normally would use up to 7 bits (for big percentage of the languages).
UTF-8 can work with all 7-bit ASCII characters, and that's what it's great about it.
For Windows this is entirely a compatibility thing. One could imagine a world in which that was not strictly necessary. A good best practice for a Win32 app is to always use 16-bit strings when calling system APIs and pretend the "ANSI" versions don't exist; I would not recommend anything else.
In NT the 8-bit versions generally do nothing but convert to 16-bit and call the "real" function. In recent versions (I think Win7 was the first to do this), AFAIK the 8-bit shims typically exist in another module, so if you don't use them they don't get loaded. In NT on ARM the 8-bit shims are not even there.
> It doubles the space for things like storing identifiers for programming languages
Which is why GetProcAddress() still takes an 8-bit string. Just because the kernel (and hence the syscall interface) uses 16-bit everywhere doesn't mean you can't use 8-bit strings in your own process, or that you can't selectively pick what makes sense for your use.
(By the way, none of what I'm saying is a criticism of UTF-8. I think it's a very clever encoding.)
If they were using UTF-8, wouldn't just one version of function suffice ? That is, instead of using a shim, just determine which version to call based on the encoding.
Another example of UTF-16 brokenness: it is byte-order dependent, that is, programmer must care about big-endian vs little-endian encoding of 16-bit numbers. This caused another hack: using non-text BOM (Byte-Order Mark) symbols to denote what endianness is used.
I think it's very ugly and may be ok for Windows-x86 worlds, but not acceptable for variety of platforms connected to the Internet.
A legit complaint. Though I'll say, in early Unicode before UTF-8 existed, this was probably seen as less of a problem than using the older non-Unicode charsets, and probably rightly so.
Also, if you keep a clean separation between serialization and in-memory representation, the BOM hack is not such a big deal. Serialize code can write a BOM, and de-serialize can do the char swapping and omit the BOM in RAM, then in-memory it's always host byte order and without BOM. Or you can just have the on-disk/over-the-wire format be UTF-8, which is common today on platforms where UTF-16 is used in RAM. (I'll repeat a point that seems to be lost on a lot of people in this thread, that Microsoft adopted 16-bit chars for NT before UTF-8 existed, so that was not an option.)
I still remember the day that the "View > Encoding" menu of browsers was absolutely critical in browsing the web if you browsed enough sites that were outside of your normal locale (your default codepage in Windows). The menu is still there, at least in Firefox & Chrome, but I haven't used it in a few years, as most things Just Work™ for modern websites, largely thanks to UTF-8.
The eventual browser default is still (sadly) browser locale dependent, so the menu exists for sites designed for a different locale. I and some others still hope one day we can get rid of that (likely replaced with per domain, primarily TLDs, defaults).
In IE, you mean? A standards-compliant browser shouldn't need to guess at the encoding. It can be found in the <meta> tag or (if that wasn't supplied or we're not dealing with HTML) in the Content-Type header. If it isn't found there, I believe HTML says, "it's ISO-8859-1 (latin1)".
That said, certain browsers have been known to guess.
IE, Firefox, Chrome, Opera, Safari all have locale-dependent defaults for HTML. The process used is essentially: user-override, BOM, higher-level metadata (e.g., Content-Type in HTTP), meta pre-parsing, and then a locale-dependent default (Windows-1252 in most locales). Anything labelled "ISO-8859-1" is actually treated as Windows-1252 in browsers (they differ only in ISO-8859-1's C1 range, so Windows-1252 is a graphical superset).
People still flame how MS uses 16-bit characters, which IMO is a bit of an unfair criticism. (Full disclosure: I used to work for them. But I was never a koolaid drinker.) I don't think they have any religious aversion to UTF-8, it's just history. By adopting Unicode as early as they did, they made this decision before UTF-8 existed and they stuck with it.
Probably the bigger crime is not switching to UTF-8 as the default "ANSI"/"multi-byte" codepage (to use the Windows terms). This means C programmers who are not Windows experts often end up writing non-Unicode-safe software because they expect every string to be char*.
(Also as has been mentioned UTF-16 is not fixed. And even in UTF-32 there are instances where a single glyph takes multiple code points - the decomposed accent marks are the one example I know, possibly there are others?)