UTF16 gives you the worst of both worlds: it wastes a lot of memory for ASCII-li...

pornel · on Aug 19, 2011

> Representing Chinese (no ASCII, many character points high up in the set) in utf-8, you begin wasting a lot of memory due to the ASCII compatibility.

That's not "a lot", that's only 33% more in the best case of purely CJK text.

OTOH any non-CJK characters are 50% smaller. It starts to even out if you add few foreign names or URLs to the text.

And for HTML, UTF-16 is just crazy. Makes HTML markup twice as expensive.

CJK websites don't use UTF-16, they use Shift-JIS or GBK, which are technically more like UTF-8.

_3u10 · on Aug 19, 2011

How exactly does UTF-16 make HTML twice as expensive?

I guarantee you that there are very few apps that will double their memory usage if you start using UTF-16 text. Even if you start looking at bandwidth once you compress the text there is very little difference. (You are compressing your HTML right?)

The case for UTF-8 saving memory makes a lot of sense if you're writing embedded software, however in most stacks the amount of memory wasted by UTF-16 is trivial compared to the amount of memory wasted by the GC, bloated libraries, interpreted code, etc.

If you're using .NET or the JVM char is 16 bits wide anyway. The UTF-8 vs. UTF-16 debate is a perfect example of mircobenchmarking where theoretically there is a great case for saving a resource but in aggregate makes very little difference.

pornel · on Aug 19, 2011

> How exactly does UTF-16 make HTML twice as expensive?

    < 0 h 0 t 0 m 0 l 0 > 0

> If you're using .NET or the JVM char is 16 bits wide anyway.

Hopefully you don't need to be worried what .Net/JVM have to do for legacy reasons and you can use UTF-8 for all input and output.

EdiX · on Aug 19, 2011

> If you widen the character even more (...)

If you go to UTF32 you waste a lot more space in any conceivable situation (extra-BMP characters are less tha 0.1% of the text even in CJK languages) and you still don't get anything in exchange thanks to combining diacriticals and ligatures. Thanks Unicode Committee!

rwmj · on Aug 19, 2011

But as one of the replies in that thread pointed out, if you want to compress strings, use LZ! UTF-16 won't give you as good compression.