Here's my follow-up: The Absolute Minimum Every Software Developer Must Know Abo...

DrJokepu · on Jan 11, 2012

It's really not that simple.

If you do a lot of string manipulations, you're better off with either UTF-32 or dumbified (16-bit fixed) UTF-16, otherwise you will have to count characters from the beginning of the string every time you need to access the nth character within the string. Moreover, if you deal with a text with a lot of characters between 0x0800 and 0xFFFF (e.g. East-Asian languages) you're much better off with UTF-16 as you will save a whole byte per character.

pjscott · on Jan 11, 2012

How often do you need random access by code point? In every case I can recall, if random access by UTF-8 offset wasn't the right thing (which it usually was), then random access by code point wouldn't be either. Almost all string offsets you'll ever have to deal with come from having a program look at a string, and in that case, you can just use UTF-8 byte offsets. What sort of text processing are you doing where this isn't the case?

As for East Asian text, you have a point: it will usually be shorter in UTF-16 than UTF-8. Before making this decision, though, ask yourself how much that extra space is worth to you. Is it worth dealing with possible encoding hassles? (The answer to this may be yes, but it's a question that should be asked.) Also, on a lot of data, there are many characters from the ASCII range mixed in with the East Asian text. I did an experiment a while back where I downloaded some random web pages in Chinese, Japanese, Korean, and Farsi, and compared their size in UTF-8 and UTF-16. Because of the amount of those documents that was HTML tags, all four pages ended up smaller in UTF-8.

DrJokepu · on Jan 11, 2012

Maybe I'm missing something but I'm not sure if I understand you question. How would you write even a simple parser without being able to access the contents of the string randomly by the character index?

pjscott · on Jan 11, 2012

You can access the contents randomly. Just use the index in bytes, rather than characters. Let's look at a really simple parsing task as an example: splitting tab-delimited strings, in UTF-8. First, you find the indexes in the string (in bytes) of the commas, then you use those to split out substrings. This is exactly the same code you would use with plain ASCII text, and in fact a lot of programs designed to process ASCII work unmodified with UTF-8.

Another example: for a lex-and-yacc type of parser, you can use regular expressions to split a string into tokens, and then use a parser on that stream-of-tokens representation. None of this requires character indexing; just byte indexing.