Too few programmers know that UTF-8 is a variable length encoding: I've heard plenty assert that in UTF-8, every character takes two bytes, while claiming simultaneously they could encode every possible character in it.
A bit broader: too few programmers understand the difference between a character set and a character encoding.
That article starts out OK and then suddenly tries to argue that you can use the terms interchangeably. You can not and you will drown in confusion if you try to. Just imagine that tomorrow, the Chinese introduce their own character set next to Unicode, but use UTF-8 to minimize the number of bytes it takes to represent their language (which makes sense, because the frequency of characters drops off pretty fast and some characters are much more common than others, so you'd like to represent those with one byte).
The fact that the HTTP RFC speaks of 'charset=utf-8' is explained by this part of the spec:
Note: This use of the term "character set" is more commonly
referred to as a "character encoding." However, since HTTP and
MIME share the same registry, it is important that the terminology also be shared.
Why does MIME use the 'wrong' terminology? Perhaps because the registry is old and the difference between set and encoding was less obvious and relevant back then. Perhaps it was simply a mistake; a detail meant to be corrected. Perhaps the person that drew it up was inept. Who knows. It doesn't matter, it is still wrong. And don't get me started on the use of character set in MySql...
Unicode is a character set, and the only character set really worth speaking of. The Unicode character set includes almost every character in every writing system on Earth. A string is a piece of text, i.e. an ordered sequence of characters all taken from the same character set.
A character encoding is a mapping/function/algorithm/set of rules which can be used to convert a string into a sequence of bytes and back again.
A character set may have multiple encodings. UTF-8 and UTF-16 are two possible encodings of the Unicode character set.
Actually, it's just stored internally as UCS-2. I'm not sure why that really matters, though. You shouldn't care how your Unicode code points are stored (i.e., the size of the integer) as long as they encode to a UTF encoding correctly.
Edit: Ah, guys it's actually UTF-16, the configure flag is just named ucs2. False alarm.
I doubt that they are encoded in UCS-2 as that character set isn't able to encode every (or even just the majority) of unicode code points.
You are right though (and this is why I upvoted you back to 1) that you shouldn't care. In fact, you not knowing the internal encoding the proof of that. In python (I'm talking python 3 here which has done this right), you don't care how a string is stored internally.
The only place where you care about this is when your strings interact with the outside world (i/o). Then your strings need to be converted into bytes and thus the internal representation must be encoded using some kind of encoding.
This is what the .decode and .encode methods are used for.
In Python 2.x are encoded in UCS-2, not UTF-16, at least by default (I'm not sure about Python 3.x, I assume it's the same though). If you want to support every single possible Unicode codepoint, you can tell Python to do so at compile time (via ./configure flag).
In practice the characters that aren't in UCS-2 tend to be characters that don't exist in modern languages, e.g. the characterset for Linear B, Domino tiles, and Cuneiform, so they're not supported since they're not of practical use to most people. There's a fairly good list at http://en.wikipedia.org/wiki/Plane_(Unicode) . In this list, Python by default doesn't support things not in the BMP.
Things outside of the BMP aren't just dead languages anymore. You have to be able to support characters outside the BMP if you want to sell your software in China:
UTF-16 behaves to UCS-2 as UTF-8 does to ASCII. Meaning: They share the character set. UTF-16 extends UCS-2 by using some reserved characters to indicate that what is following should be interpreted according to UTF-16 rules. So just like UTF-8.
Meaning: Every UCS-2 document is also an UTF-16 document, but not the reverse (just like every ASCII document is also an UTF-8 document).
But as I said below: It doesn't matter and could even be a totally proprietary character set as long as pythons string operations work on that character set and as long as there's a way to decode input data into that set and encode output data from that set.
You should very much care about that, because if your tool stores text as UCS-2, it means that it doesn't support unicode at all, UCS-2 stopped being a valid encoding a long time ago.
You are completely right, I'm sorry about my previous comment.
The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:
>>> list(u'\U00010000')
[u'\ud800', u'\udc00']
If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...
No, that's the correct behavior. list only incidentally returns a single character in ASCII strings -- it's not required to. You shouldn't be using list on raw unicode strings.
u'\U00010000'.encode('utf-8')
should produce the same result on every Python version.
> You shouldn't be using list on raw unicode strings.
Why? I am using list only to show what are the values of s[0] and s[1].
What I am saying is that it returns the list of characters of the underlying representation, so a list of wide chars (possibly surrogate) if compiled with UTF16 or a list of 32bit characters if compiled with UTF16.
Are you suggesting that all the string processing (including iteration) should be done on a str encoded in UTF8 instead of using the native unicode type?
if you want to deal with characters with high numbers you should know code points stuff. for example, the String.length() would return a number of two-bytes chars, not real four bytes characters, which may confuse someone
Exactly. A Java char is not synonymous with a Unicode code point. But the majority of the time they are synonymous, older documentation claimed that they were the same, and this is the meme that many Java programmers (in my experience) have.
That's actually my point. Python supports Unicode code points and UTF. If you get the output encoding in UTF-8 it would actually be variable length chars. What's important is your coding output, not the internal code point representation.
It leaks through in some places. For example, len(u'\U0001D310') (from the Tai Xuan Jing Symbols) returns 1 on 32-bit wide pythons, and returns 2 on the default 16-bit wide builds.
The Windows NT development team made the decision to standardise on UTF-16. Every release of Windows since the original NT uses UTF-16 internally for all its "wide character" API calls (e.g., wcslen() and FindWindowW()).
I don't know about "preferring", but anyone manipulating strings in JavaScript is effectively using UTF-16 (or more precisely is using arrays of 16-bit integers which a web browser will interpret as UTF-16-encoded Unicode if you tell it that the array contains text).
As a consequence at least Gecko and Webkit both use UTF-16 for their string classes, though there has been talk of trying to switch Gecko over to UTF-8. The problem then would be implementing the JS string APIs on top of UTF-8 strings efficiently.
Strings are big-endian UTF-16 by default even in Cocoa (stored in an array of unsigned shorts). Worst of all GCC define the wchar_t as a 4 byte int unless you specify -fshort-wchar.
As far as I know, wchar_t is meant to be an internal only representation, so it's good that it is 32 bits--that way you are in one codepoint per word territory. It's a mistake to think you can just overlay some unicode binary data with a wchar_t pointer--you need to convert into and out of wchat_t from utf8/utf-16/whatever. Otherwise you aren't handling codepoints above 16 bits correctly.
This is a common misconception about UTF-8 vs. UTF-16. You're missing two important facts.
1. Most UTF-8 string operations can operate on a byte at a time. You just have realize that functions like strlen will be telling you byte length instead of character length, and this usually doesn't even matter. (It's still important to know.)
2. UTF-16 is still a variable-width encoding. It was originally intended to be fixed-width, but then the Unicode character set grew too large to be represented in 16 bits.
If I have no Surrogate Range CPs in a string, it is far easier to work with UTF-16 than UTF-8 at the byte level, because all chars are constant size. For UTF-8 that only applies to ASCII. And SRs characters are extraordinarily rare, while non-ASCII chars are extremely common. So my programs ensure at the entry points the string is UCS-2 compatible, and then all subsequent string manipulations are far less complex to handle than with UTF-8.
UTF-16 was never intended to be a fixed-width encoding and has been created in order to support characters outside the BMP which aren't covered by UCS-2.
Whatever environments jumped on Unicode early, before it was realized that 2 bytes wouldn't be enough, all chose to use UCS-2 for obvious reasons. In particular, that includes Windows and Java.
Probably because they figured they could just ignore endianness issues and that ASCII compatibility would be Somebody Else's Problem.
There were always problems with UCS-2. UTF-8 would have had a number of advantages over it even if Unicode had never grown beyond the BMP (Basic Multilingual Plane, the first and lowest-numbered 16-bit code space).
> ASCII compatibility would be Somebody Else's Problem
for many of those outside "A" in ASCII (euphemism for America :) there were already a ton of problems, so endianness was the least (i personally never hit this problem)
// disclaimer: i'm not that serious about predominance of Latin script, this is sorta irony
Depending on the level of abstraction you're living at - and that depends on the overall goal, performance constraints, environmental integration, OS / machine heterogeneity etc. - it may or may not be a problem.
It's easy to dismiss if you have all the time in the world and a deep stack of abstractions.
If you're doing deep packet analysis on UTF-16 text in a router, things may be different.
thanks, my question was right about the issues met by people living in another levels of abstractions.
i'm not a native english speaker and a newb to HN, so sorry that i put my sincere question so that it looked like arrogant statement 'there are no issues, what are you talking about, i even don't know what LE and BE mean'.
> or many of those outside "A" in ASCII (euphemism for America :)
Abbreviation for 'American', in fact. No euphemisms needed.
(ASCII = American Standard Code for Information Interchange)
> there were already a ton of problems, so endianness was the least
I can appreciate this. However, UTF-8 also has desirable properties like 'dropping a single byte only means you lose one character, as opposed to potentially losing the whole file', and 'you can often tell if a multi-byte UTF-8 sequence has been corrupted without doing complex analysis'.
> i'm not that serious about predominance of Latin script, this is sorta irony
Heh. ASCII can't even encode the entirety of the Latin script: Ask a Frenchman how he spells 'café', or a German how he spells 'straße', and notice how important characters are missing from ASCII.