Hacker News new | past | comments | ask | show | jobs | submit login

UTF16 gives you the worst of both worlds: it wastes a lot of memory for ASCII-like languages (mostly ASCII plus some special characters), you have to deal with byte ordering and you still don't get the advantages of directly addressing a specific character without parsing the whole string up to the character you want.

But.

If you widen the character even more, you'd probably still want it to be somewhat word-aligned so you'd use 32bit per character which would have enough storage to store all of Unicode plus some. The cost though is obvious: you waste memory.

Depending on the language, you even waste a lot of memory: think ASCII or ASCII-like (the middle european languages). In UHF-8 those need - depending on language - barely more than one byte up to two bytes per character. Representing these languages with 4 bytes per character makes you use nearly 4 times the memory reasonably needed.

This changes the farther east you move. Representing Chinese (no ASCII, many character points high up in the set) in utf-8, you begin wasting a lot of memory due to the ASCII compatibility. As encoding a Unicode code point in utf-8 uses around one byte more than if you would just store the code point as an integer.

So on international software running on potentially limited memory while targeting the eastern languages, you will again be better off using utf-16 as it requires less storage for characters really high up in the Unicode plane.

Also, if you know that you are just extended ASCII, you can optimize and access characters directly without parsing the whole string, giving you another speed advantage.

I don't know what's the best way to go. 32 bits is wasteful, utf-16 is sometimes wasteful, has endianness issues and still needs parsing (but is less wasteful than 32bits in most realistic cases) and utf-8 is really wasteful for high code points and always requires parsing but doesn't have the endianness issues.

I guess as always these are just tools and you have to pick what works in your situation. Developers have to adapt to what was picked.




> Representing Chinese (no ASCII, many character points high up in the set) in utf-8, you begin wasting a lot of memory due to the ASCII compatibility.

That's not "a lot", that's only 33% more in the best case of purely CJK text.

OTOH any non-CJK characters are 50% smaller. It starts to even out if you add few foreign names or URLs to the text.

And for HTML, UTF-16 is just crazy. Makes HTML markup twice as expensive.

CJK websites don't use UTF-16, they use Shift-JIS or GBK, which are technically more like UTF-8.


How exactly does UTF-16 make HTML twice as expensive?

I guarantee you that there are very few apps that will double their memory usage if you start using UTF-16 text. Even if you start looking at bandwidth once you compress the text there is very little difference. (You are compressing your HTML right?)

The case for UTF-8 saving memory makes a lot of sense if you're writing embedded software, however in most stacks the amount of memory wasted by UTF-16 is trivial compared to the amount of memory wasted by the GC, bloated libraries, interpreted code, etc.

If you're using .NET or the JVM char is 16 bits wide anyway. The UTF-8 vs. UTF-16 debate is a perfect example of mircobenchmarking where theoretically there is a great case for saving a resource but in aggregate makes very little difference.


> How exactly does UTF-16 make HTML twice as expensive?

    < 0 h 0 t 0 m 0 l 0 > 0
> If you're using .NET or the JVM char is 16 bits wide anyway.

Hopefully you don't need to be worried what .Net/JVM have to do for legacy reasons and you can use UTF-8 for all input and output.


> If you widen the character even more (...)

If you go to UTF32 you waste a lot more space in any conceivable situation (extra-BMP characters are less tha 0.1% of the text even in CJK languages) and you still don't get anything in exchange thanks to combining diacriticals and ligatures. Thanks Unicode Committee!


But as one of the replies in that thread pointed out, if you want to compress strings, use LZ! UTF-16 won't give you as good compression.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: