Hacker News new | past | comments | ask | show | jobs | submit login

Don't do this. Use a language (like C#) or library (like libunistring) that can do grapheme cluster segmentation. In .NET it's StringInfo.GetTextElementEnumerator(). In libunistring it's u8_grapheme_breaks(). In ICU4C it's icu::BreakIterator::createCharacterInstance(). In Ruby it's each_grapheme_cluster(). Other ecosystems with rich Unicode support should have similar functionality.



I pasted your comment here into GPT-4o and asked for the Python equivalent, it suggested this which seems to work well:

    import regex as re
    
    def grapheme_clusters(text):
        # \X is the regex pattern that matches a grapheme cluster
        pattern = re.compile(r'\X')
        return [
            match.group(0)
            for match in
            pattern.finditer(text)
        ]
https://chatgpt.com/share/481c9c94-0431-4fcb-82aa-a44a4f3c21...


Note that regex is not the re module from the stdlib, it's a separate third party module that exposes the more powerful capabilities of PCRE like grapheme clustering directly.


That's a good callout, here's the docs for \X in that regex module: https://github.com/mrabarnett/mrab-regex?tab=readme-ov-file#...


That’s fine unless you are a language or library creator in which case knowing how to do it properly can’t be deferred to someone else. Perhaps porting someone else’s correct implementation is good but someone somewhere has to implement this. If they don’t share their knowledge this will always be esoteric knowledge locked away unless those who do that kind of work share their knowledge and experience. Most of us are not those people, but some are.


As it turns out, I am writing my own language, and my language supports grapheme cluster segmentation. I just used libunistring (and before that, I used ICU). TFA is not doing this correctly at all; the Unicode specification provides the rules for grapheme cluster segmentation if you wish to implement it yourself[0]. There's nothing to be learned from TFA's hacky and fundamentally incorrect approach. OP's technique will freely chop combining code points that needed to be kept.

[0] https://unicode.org/reports/tr29/


Hi, I'm one of the people who are library authors in this area.

This article is very specific to Perl, and the way it does so is also subject to question - it does not look efficient.

You will be better off by reading excellent wikipedia page on UTF-8: https://en.wikipedia.org/wiki/UTF-8

Now, extended grapheme cluster enumeration is much more complex than finding the next non-continuation byte (or counting such), but to perform those correctly you would ultimately end up reading the official spec at unicode.org and perusing reference implementations like ICU (which is painful to read) or from standard library/popular packages for Rust/Java/C#/Swift (the decent ones I'm aware of, do not look at C++).


"I had a problem, and here's my working solution for my specific case."

-"Don't do this. Instead use a completely different programming language."


In Perl (OP's chosen language) you can use the Unicode::Util package. That's why I was pretty clear that you can use a different language or a different library. This seems to be a pretty uncharitable reading of my post. Use the right tool for the job.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: