i guess utf-8 requires more computations while processing strings. if you work w...

edsrzf · on Feb 8, 2011

This is a common misconception about UTF-8 vs. UTF-16. You're missing two important facts.

1. Most UTF-8 string operations can operate on a byte at a time. You just have realize that functions like strlen will be telling you byte length instead of character length, and this usually doesn't even matter. (It's still important to know.)

2. UTF-16 is still a variable-width encoding. It was originally intended to be fixed-width, but then the Unicode character set grew too large to be represented in 16 bits.

fnl · on Feb 8, 2011

If I have no Surrogate Range CPs in a string, it is far easier to work with UTF-16 than UTF-8 at the byte level, because all chars are constant size. For UTF-8 that only applies to ASCII. And SRs characters are extraordinarily rare, while non-ASCII chars are extremely common. So my programs ensure at the entry points the string is UCS-2 compatible, and then all subsequent string manipulations are far less complex to handle than with UTF-8.

program · on Feb 8, 2011

UTF-16 was never intended to be a fixed-width encoding and has been created in order to support characters outside the BMP which aren't covered by UCS-2.

edsrzf · on Feb 8, 2011

Okay, you're right. I shortened what I was trying to say too much. UTF-16 evolved from UCS-2, so its roots are in a 16-bit, fixed-width encoding.

lysium · on Feb 8, 2011

But UTF-16 is variable encoding, too?