> Size is always a trade-off and there won't be one standard for encoding. I...

klodolph · on Aug 19, 2011

The oft-stated advantage of UTF-32/UCS4 is that you can do random access. But random character access is almost entirely useless for real text processing tasks. (You can still do random byte access for UTF-8 text, and if your regexp engine spits out byte offsets, you're fine.)

Even when you're doing something "simple" like upcasing/downcasing, the advantages of UTF-32 are not great. You are still converting variable length sequences to other variable length sequences -- e.g., eszett upcases to SS.

Now the final piece to this is that for some language implementations, compilation times are dominated by lexical analysis. Sometimes, significant speed gains can be had by dealing with UTF-8 directly rather than UTF-32 because memory and disk representation are identical, and memory bandwidth affects parsing performance. This doesn't matter for most people, but it matters to the Clang developers, for example. Additional system speed gains are had from reducing memory pressure.

Sure, we have plenty of memory and processor power these days. But simpler code isn't always worth 3-4x memory usage.

Text is not simple.

pornel · on Aug 19, 2011

> I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle

We have 3 levels of caching and hyperthreading cores because memory access is so ridiculously slow compared to the CPU. Quadrupling amount of data that goes through this bottleneck isn't going to help.

> Anything that's not UCS4 is already variable-length-encoded

You can't access n-th character in UCS4 anyway, because Unicode has combining characters (e.g. ü may be ¨ + u).