I honestly had no idea there were parts of the development community that actual...

btilly · on Feb 8, 2011

Anyone who programs in Java is using UTF-16. And in my experience very few Java programmers understand that UTF-16 is a variable length encoding.

Confusion · on Feb 8, 2011

Too few programmers know that UTF-8 is a variable length encoding: I've heard plenty assert that in UTF-8, every character takes two bytes, while claiming simultaneously they could encode every possible character in it.

A bit broader: too few programmers understand the difference between a character set and a character encoding.

fedd · on Feb 8, 2011

> too few programmers understand the difference between a character set and a character encoding

why then they have the same names??? :)))

ps/ http://www.grauw.nl/blog/entry/254 - is this article ok? first in google by "charset encoding difference"

Confusion · on Feb 8, 2011

That article starts out OK and then suddenly tries to argue that you can use the terms interchangeably. You can not and you will drown in confusion if you try to. Just imagine that tomorrow, the Chinese introduce their own character set next to Unicode, but use UTF-8 to minimize the number of bytes it takes to represent their language (which makes sense, because the frequency of characters drops off pretty fast and some characters are much more common than others, so you'd like to represent those with one byte).

The fact that the HTTP RFC speaks of 'charset=utf-8' is explained by this part of the spec:

  Note: This use of the term "character set" is more commonly
  referred to as a "character encoding." However, since HTTP and
  MIME share the same registry, it is important that the terminology also be shared.

Why does MIME use the 'wrong' terminology? Perhaps because the registry is old and the difference between set and encoding was less obvious and relevant back then. Perhaps it was simply a mistake; a detail meant to be corrected. Perhaps the person that drew it up was inept. Who knows. It doesn't matter, it is still wrong. And don't get me started on the use of character set in MySql...

qntm · on Feb 8, 2011

Unicode is a character set, and the only character set really worth speaking of. The Unicode character set includes almost every character in every writing system on Earth. A string is a piece of text, i.e. an ordered sequence of characters all taken from the same character set.

A character encoding is a mapping/function/algorithm/set of rules which can be used to convert a string into a sequence of bytes and back again.

A character set may have multiple encodings. UTF-8 and UTF-16 are two possible encodings of the Unicode character set.

machrider · on Feb 8, 2011

Python, too (it can be compiled to use UTF-32, but normally uses UTF-16 to mean "Unicode").

Locke1689 · on Feb 8, 2011

Actually, it's just stored internally as UCS-2. I'm not sure why that really matters, though. You shouldn't care how your Unicode code points are stored (i.e., the size of the integer) as long as they encode to a UTF encoding correctly.

Edit: Ah, guys it's actually UTF-16, the configure flag is just named ucs2. False alarm.

pilif · on Feb 8, 2011

I doubt that they are encoded in UCS-2 as that character set isn't able to encode every (or even just the majority) of unicode code points.

You are right though (and this is why I upvoted you back to 1) that you shouldn't care. In fact, you not knowing the internal encoding the proof of that. In python (I'm talking python 3 here which has done this right), you don't care how a string is stored internally.

The only place where you care about this is when your strings interact with the outside world (i/o). Then your strings need to be converted into bytes and thus the internal representation must be encoded using some kind of encoding.

This is what the .decode and .encode methods are used for.

Have a look at http://diveintopython3.org/strings.html which manages to say this better (and with more words) than I ever would be able to.

eklitzke · on Feb 8, 2011

In Python 2.x are encoded in UCS-2, not UTF-16, at least by default (I'm not sure about Python 3.x, I assume it's the same though). If you want to support every single possible Unicode codepoint, you can tell Python to do so at compile time (via ./configure flag).

In practice the characters that aren't in UCS-2 tend to be characters that don't exist in modern languages, e.g. the characterset for Linear B, Domino tiles, and Cuneiform, so they're not supported since they're not of practical use to most people. There's a fairly good list at http://en.wikipedia.org/wiki/Plane_(Unicode) . In this list, Python by default doesn't support things not in the BMP.

Locke1689 · on Feb 8, 2011

No, the Python internals support surrogates so you can support characters outside the BMP. This makes it (basically) UTF-16.

sedachv · on Feb 9, 2011

Things outside of the BMP aren't just dead languages anymore. You have to be able to support characters outside the BMP if you want to sell your software in China:

http://en.wikipedia.org/wiki/GB_18030

pilif · on Feb 8, 2011

UTF-16 behaves to UCS-2 as UTF-8 does to ASCII. Meaning: They share the character set. UTF-16 extends UCS-2 by using some reserved characters to indicate that what is following should be interpreted according to UTF-16 rules. So just like UTF-8.

Meaning: Every UCS-2 document is also an UTF-16 document, but not the reverse (just like every ASCII document is also an UTF-8 document).

But as I said below: It doesn't matter and could even be a totally proprietary character set as long as pythons string operations work on that character set and as long as there's a way to decode input data into that set and encode output data from that set.

fhars · on Feb 8, 2011

You should very much care about that, because if your tool stores text as UCS-2, it means that it doesn't support unicode at all, UCS-2 stopped being a valid encoding a long time ago.

Locke1689 · on Feb 8, 2011

As the parent noted, it can be compiled for UTF-32 support. Just recompile if you need the extra characters.

Edit: Also, turns out it's UTF-16. The configure flag is named ucs2.

ot · on Feb 8, 2011

No, it is really UCS-2:

  >>> unichr(0x10000)
  ------------------------------------------------------------
  Traceback (most recent call last):
    File "<ipython console>", line 1, in <module>
  ValueError: unichr() arg not in range(0x10000) (narrow Python build)

If you want to support codepoints greater than 0x10000 you have to recompile with the option UTF32.

I think it must be a constant-lenght encoding to allow s[i] to be constant time.

Locke1689 · on Feb 8, 2011

Guido has a different opinion: http://mail.python.org/pipermail/python-dev/2008-July/080895...

ot · on Feb 8, 2011

You are completely right, I'm sorry about my previous comment.

The strange thing is that I couldn't find any reference to surrogate pairs in the Python documentation, so I was assuming that the elements of an unicode strings were complete codepoints. Instead this is not the case:

  >>> list(u'\U00010000')
  [u'\ud800', u'\udc00']

If I had Python compiled with the UTF32 option, this would return a single element, so Python is leaking an implementation detail that can change across builds. That's really really bad...

Locke1689 · on Feb 8, 2011

No, that's the correct behavior. list only incidentally returns a single character in ASCII strings -- it's not required to. You shouldn't be using list on raw unicode strings.

  u'\U00010000'.encode('utf-8')

should produce the same result on every Python version.

ot · on Feb 9, 2011

> You shouldn't be using list on raw unicode strings.

Why? I am using list only to show what are the values of s[0] and s[1].

What I am saying is that it returns the list of characters of the underlying representation, so a list of wide chars (possibly surrogate) if compiled with UTF16 or a list of 32bit characters if compiled with UTF16.

Are you suggesting that all the string processing (including iteration) should be done on a str encoded in UTF8 instead of using the native unicode type?

fedd · on Feb 8, 2011

if you want to deal with characters with high numbers you should know code points stuff. for example, the String.length() would return a number of two-bytes chars, not real four bytes characters, which may confuse someone

//edit: this is about Java

btilly · on Feb 8, 2011

Exactly. A Java char is not synonymous with a Unicode code point. But the majority of the time they are synonymous, older documentation claimed that they were the same, and this is the meme that many Java programmers (in my experience) have.

fedd · on Feb 8, 2011

yes. i write my java-based matrix to be code-points aware so that no-one in Japan and China using it would face any problems.

Locke1689 · on Feb 8, 2011

That's actually my point. Python supports Unicode code points and UTF. If you get the output encoding in UTF-8 it would actually be variable length chars. What's important is your coding output, not the internal code point representation.

pieter · on Feb 8, 2011

It leaks through in some places. For example, len(u'\U0001D310') (from the Tai Xuan Jing Symbols) returns 1 on 32-bit wide pythons, and returns 2 on the default 16-bit wide builds.

Locke1689 · on Feb 8, 2011

Nope, that's the correct behavior. Run len on the UTF encode and you'll get the expected result.

getsat · on Feb 8, 2011

The Windows NT development team made the decision to standardise on UTF-16. Every release of Windows since the original NT uses UTF-16 internally for all its "wide character" API calls (e.g., wcslen() and FindWindowW()).

sedachv · on Feb 9, 2011

Windows has only had UTF-16 interfaces since 2000, NT was based around UCS-2.

getsat · on Feb 10, 2011

Hmm. I recall the SysInternals books saying it was NT4, not 5. I could be mistaken, though.

bzbarsky · on Feb 8, 2011

I don't know about "preferring", but anyone manipulating strings in JavaScript is effectively using UTF-16 (or more precisely is using arrays of 16-bit integers which a web browser will interpret as UTF-16-encoded Unicode if you tell it that the array contains text).

As a consequence at least Gecko and Webkit both use UTF-16 for their string classes, though there has been talk of trying to switch Gecko over to UTF-8. The problem then would be implementing the JS string APIs on top of UTF-8 strings efficiently.

program · on Feb 8, 2011

Strings are big-endian UTF-16 by default even in Cocoa (stored in an array of unsigned shorts). Worst of all GCC define the wchar_t as a 4 byte int unless you specify -fshort-wchar.

__david__ · on Feb 8, 2011

As far as I know, wchar_t is meant to be an internal only representation, so it's good that it is 32 bits--that way you are in one codepoint per word territory. It's a mistake to think you can just overlay some unicode binary data with a wchar_t pointer--you need to convert into and out of wchat_t from utf8/utf-16/whatever. Otherwise you aren't handling codepoints above 16 bits correctly.

fauigerzigerk · on Feb 8, 2011

Unfortunately the size of wchar_t is platform/compiler dependent.

fedd · on Feb 8, 2011

i guess utf-8 requires more computations while processing strings. if you work with just 2 bytes chars, it (might) work faster

// upvoted all replies, you're right

edsrzf · on Feb 8, 2011

This is a common misconception about UTF-8 vs. UTF-16. You're missing two important facts.

1. Most UTF-8 string operations can operate on a byte at a time. You just have realize that functions like strlen will be telling you byte length instead of character length, and this usually doesn't even matter. (It's still important to know.)

2. UTF-16 is still a variable-width encoding. It was originally intended to be fixed-width, but then the Unicode character set grew too large to be represented in 16 bits.

fnl · on Feb 8, 2011

If I have no Surrogate Range CPs in a string, it is far easier to work with UTF-16 than UTF-8 at the byte level, because all chars are constant size. For UTF-8 that only applies to ASCII. And SRs characters are extraordinarily rare, while non-ASCII chars are extremely common. So my programs ensure at the entry points the string is UCS-2 compatible, and then all subsequent string manipulations are far less complex to handle than with UTF-8.

program · on Feb 8, 2011

UTF-16 was never intended to be a fixed-width encoding and has been created in order to support characters outside the BMP which aren't covered by UCS-2.

edsrzf · on Feb 8, 2011

Okay, you're right. I shortened what I was trying to say too much. UTF-16 evolved from UCS-2, so its roots are in a 16-bit, fixed-width encoding.

lysium · on Feb 8, 2011

But UTF-16 is variable encoding, too?

sid0 · on Feb 8, 2011

Whatever environments jumped on Unicode early, before it was realized that 2 bytes wouldn't be enough, all chose to use UCS-2 for obvious reasons. In particular, that includes Windows and Java.

JoachimSchipper · on Feb 8, 2011

Well, the Plan 9 guys famously got this right...

derleth · on Feb 8, 2011

> all chose to use UCS-2 for obvious reasons

Probably because they figured they could just ignore endianness issues and that ASCII compatibility would be Somebody Else's Problem.

There were always problems with UCS-2. UTF-8 would have had a number of advantages over it even if Unicode had never grown beyond the BMP (Basic Multilingual Plane, the first and lowest-numbered 16-bit code space).

yuhong · on Feb 8, 2011

Note however that UTF-8 did not exist in the early days and UTF-1 sucked.

fedd · on Feb 8, 2011

> endianness issues

what are the issues?

> ASCII compatibility would be Somebody Else's Problem

for many of those outside "A" in ASCII (euphemism for America :) there were already a ton of problems, so endianness was the least (i personally never hit this problem)

// disclaimer: i'm not that serious about predominance of Latin script, this is sorta irony

barrkel · on Feb 8, 2011

Endian: there's little-endian UTF-16LE and big-endian UTF-16BE, mirror images of one another.

fedd · on Feb 8, 2011

i thought the word 'issue' meant 'a problem'...

barrkel · on Feb 8, 2011

Depending on the level of abstraction you're living at - and that depends on the overall goal, performance constraints, environmental integration, OS / machine heterogeneity etc. - it may or may not be a problem.

It's easy to dismiss if you have all the time in the world and a deep stack of abstractions.

If you're doing deep packet analysis on UTF-16 text in a router, things may be different.

fedd · on Feb 8, 2011

thanks, my question was right about the issues met by people living in another levels of abstractions.

i'm not a native english speaker and a newb to HN, so sorry that i put my sincere question so that it looked like arrogant statement 'there are no issues, what are you talking about, i even don't know what LE and BE mean'.

will learn.

peapicker · on Feb 8, 2011

If you're not sending UTF-16 text across the wire in network byte order, you are already in a world of pain.

derleth · on Feb 8, 2011

> or many of those outside "A" in ASCII (euphemism for America :)

Abbreviation for 'American', in fact. No euphemisms needed.

(ASCII = American Standard Code for Information Interchange)

> there were already a ton of problems, so endianness was the least

I can appreciate this. However, UTF-8 also has desirable properties like 'dropping a single byte only means you lose one character, as opposed to potentially losing the whole file', and 'you can often tell if a multi-byte UTF-8 sequence has been corrupted without doing complex analysis'.

> i'm not that serious about predominance of Latin script, this is sorta irony

Heh. ASCII can't even encode the entirety of the Latin script: Ask a Frenchman how he spells 'café', or a German how he spells 'straße', and notice how important characters are missing from ASCII.