Hacker News new | past | comments | ask | show | jobs | submit login

Pretty ugly bug. Where did it happen? Anyway, congrats & long life!



Apparently this is in the code that decomposes strings so that they can be compared (necessary with Hebrew and Arabic seemingly?).

This is an interesting snippet from that code...:

        /* decompose the character if necessary, into 'base' characters
        * because I don't care about Arabic, I will hard-code the Hebrew
        * which I *do* care about! So sue me... */
        if (c1 != c2 && (!ireg_ic || utf_fold(c1) != utf_fold(c2)))
        {
            /* decomposition necessary? */
            mb_decompose(c1, &c11, &junk, &junk);
            mb_decompose(c2, &c12, &junk, &junk);
            c1 = c11;
            c2 = c12;
            if (c11 != c12 && (!ireg_ic || utf_fold(c11) != utf_fold(c12)))
                break;
        }
Apparently string comparison is harder than I previously thought.


String comparison with Unicode is pretty astonishingly complex, partially because equality is not as well defined as it seems to be on the surface. Should e and é be equal? If you're dealing with user input from people who are unlikely to know how to type é, then they probably should, but in many cases they shouldn't. A more complex case is é and é (precomposed vs decomposed forms), which nearly always should be equal, but a simple byte comparison will say they're different.

Fortunately, there are ICU bindings for every non-toy language which solves these sorts of problems for you (although ICU has the drawback of being absolutely huge).


There are a lot of things I have seen in Unicode that seem like they should not exist in the first place. MATHEMATICAL [BOLD|SANS-SERIF|DOUBLE-STRUCK|MONOSPACE] DIGIT for example... I guess those things potentially carry significant meaning in some mathematics texts though.

I guess the ICU stuff probably gives you an strtol equivalent that can handle that sort of stuff.


The LibreOffice guys have told me hat ICU has security concerns and is, for all intents and purposes, no longer being developed. They are switching to another engine (hard buzz? Name escapes me).

Anyone know if this is true?


HarfBuzz and ICU are very different things. HarfBuzz is a small library for text shaping (basically, putting font glyphs together to form words, which can be quite complex for some scripts). ICU on the other hand can do pretty much everything that's vaguely related to internationalization. It's quite possible that LO was only using it for text shaping of course.


Harfbuzz [1] is for text shaping/layout, not unicode support.

1. http://www.freedesktop.org/wiki/Software/HarfBuzz/


Drat it. iPad "corrected" my spelling and I didn't notice.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: