Hacker News new | past | comments | ask | show | jobs | submit login

i guess utf-8 requires more computations while processing strings. if you work with just 2 bytes chars, it (might) work faster

// upvoted all replies, you're right




This is a common misconception about UTF-8 vs. UTF-16. You're missing two important facts.

1. Most UTF-8 string operations can operate on a byte at a time. You just have realize that functions like strlen will be telling you byte length instead of character length, and this usually doesn't even matter. (It's still important to know.)

2. UTF-16 is still a variable-width encoding. It was originally intended to be fixed-width, but then the Unicode character set grew too large to be represented in 16 bits.


If I have no Surrogate Range CPs in a string, it is far easier to work with UTF-16 than UTF-8 at the byte level, because all chars are constant size. For UTF-8 that only applies to ASCII. And SRs characters are extraordinarily rare, while non-ASCII chars are extremely common. So my programs ensure at the entry points the string is UCS-2 compatible, and then all subsequent string manipulations are far less complex to handle than with UTF-8.


UTF-16 was never intended to be a fixed-width encoding and has been created in order to support characters outside the BMP which aren't covered by UCS-2.


Okay, you're right. I shortened what I was trying to say too much. UTF-16 evolved from UCS-2, so its roots are in a 16-bit, fixed-width encoding.


But UTF-16 is variable encoding, too?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: