Hacker News new | past | comments | ask | show | jobs | submit login

> Size is always a trade-off and there won't be one standard for encoding.

I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle, we can't just have "raw" text (UCS4) to manipulate in memory, and compressed text (using any standard stream compression algorithm) on disk/in the DB/over the wire.

Anything that's not UCS4 is already variable-length-encoded, so you lose the ability to random-seek it anyway; and (safely performing) any complex text manipulation, e.g. upper-casing, requires temporarily converting the text to UCS4 anyway. At that point, you may as well go all the way, and serialize it as efficiently as possible, if you're just going to spit it out somewhere else. I guess the only difference is that string-append operations would require un-compressing compressed strings and then re-compressing the result—but you could defer that as long as necessary using rope[1].

[1] http://en.wikipedia.org/wiki/Rope_(computer_science)




The oft-stated advantage of UTF-32/UCS4 is that you can do random access. But random character access is almost entirely useless for real text processing tasks. (You can still do random byte access for UTF-8 text, and if your regexp engine spits out byte offsets, you're fine.)

Even when you're doing something "simple" like upcasing/downcasing, the advantages of UTF-32 are not great. You are still converting variable length sequences to other variable length sequences -- e.g., eszett upcases to SS.

Now the final piece to this is that for some language implementations, compilation times are dominated by lexical analysis. Sometimes, significant speed gains can be had by dealing with UTF-8 directly rather than UTF-32 because memory and disk representation are identical, and memory bandwidth affects parsing performance. This doesn't matter for most people, but it matters to the Clang developers, for example. Additional system speed gains are had from reducing memory pressure.

Sure, we have plenty of memory and processor power these days. But simpler code isn't always worth 3-4x memory usage.

Text is not simple.


> I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent idle

We have 3 levels of caching and hyperthreading cores because memory access is so ridiculously slow compared to the CPU. Quadrupling amount of data that goes through this bottleneck isn't going to help.

> Anything that's not UCS4 is already variable-length-encoded

You can't access n-th character in UCS4 anyway, because Unicode has combining characters (e.g. ü may be ¨ + u).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: