Pretty ugly bug. Where did it happen? Anyway, congrats & long life!

jlgreco · on May 21, 2013

Apparently this is in the code that decomposes strings so that they can be compared (necessary with Hebrew and Arabic seemingly?).

This is an interesting snippet from that code...:

        /* decompose the character if necessary, into 'base' characters
        * because I don't care about Arabic, I will hard-code the Hebrew
        * which I *do* care about! So sue me... */
        if (c1 != c2 && (!ireg_ic || utf_fold(c1) != utf_fold(c2)))
        {
            /* decomposition necessary? */
            mb_decompose(c1, &c11, &junk, &junk);
            mb_decompose(c2, &c12, &junk, &junk);
            c1 = c11;
            c2 = c12;
            if (c11 != c12 && (!ireg_ic || utf_fold(c11) != utf_fold(c12)))
                break;
        }

Apparently string comparison is harder than I previously thought.

plorkyeran · on May 21, 2013

String comparison with Unicode is pretty astonishingly complex, partially because equality is not as well defined as it seems to be on the surface. Should e and é be equal? If you're dealing with user input from people who are unlikely to know how to type é, then they probably should, but in many cases they shouldn't. A more complex case is é and é (precomposed vs decomposed forms), which nearly always should be equal, but a simple byte comparison will say they're different.

Fortunately, there are ICU bindings for every non-toy language which solves these sorts of problems for you (although ICU has the drawback of being absolutely huge).

jlgreco · on May 21, 2013

There are a lot of things I have seen in Unicode that seem like they should not exist in the first place. MATHEMATICAL [BOLD|SANS-SERIF|DOUBLE-STRUCK|MONOSPACE] DIGIT for example... I guess those things potentially carry significant meaning in some mathematics texts though.

I guess the ICU stuff probably gives you an strtol equivalent that can handle that sort of stuff.

chris_wot · on May 22, 2013

The LibreOffice guys have told me hat ICU has security concerns and is, for all intents and purposes, no longer being developed. They are switching to another engine (hard buzz? Name escapes me).

Anyone know if this is true?

pdw · on May 22, 2013

HarfBuzz and ICU are very different things. HarfBuzz is a small library for text shaping (basically, putting font glyphs together to form words, which can be quite complex for some scripts). ICU on the other hand can do pretty much everything that's vaguely related to internationalization. It's quite possible that LO was only using it for text shaping of course.

lucian1900 · on May 22, 2013

Harfbuzz [1] is for text shaping/layout, not unicode support.

1. http://www.freedesktop.org/wiki/Software/HarfBuzz/

chris_wot · on May 22, 2013

Drat it. iPad "corrected" my spelling and I didn't notice.