Don't do this. Use a language (like C#) or library (like libunistring) that can do grapheme cluster segmentation. In .NET it's StringInfo.GetTextElementEnumerator(). In libunistring it's u8_grapheme_breaks(). In ICU4C it's icu::BreakIterator::createCharacterInstance(). In Ruby it's each_grapheme_cluster(). Other ecosystems with rich Unicode support should have similar functionality.
I pasted your comment here into GPT-4o and asked for the Python equivalent, it suggested this which seems to work well:
import regex as re
def grapheme_clusters(text):
# \X is the regex pattern that matches a grapheme cluster
pattern = re.compile(r'\X')
return [
match.group(0)
for match in
pattern.finditer(text)
]
Note that regex is not the re module from the stdlib, it's a separate third party module that exposes the more powerful capabilities of PCRE like grapheme clustering directly.
That’s fine unless you are a language or library creator in which case knowing how to do it properly can’t be deferred to someone else. Perhaps porting someone else’s correct implementation is good but someone somewhere has to implement this. If they don’t share their knowledge this will always be esoteric knowledge locked away unless those who do that kind of work share their knowledge and experience. Most of us are not those people, but some are.
As it turns out, I am writing my own language, and my language supports grapheme cluster segmentation. I just used libunistring (and before that, I used ICU). TFA is not doing this correctly at all; the Unicode specification provides the rules for grapheme cluster segmentation if you wish to implement it yourself[0]. There's nothing to be learned from TFA's hacky and fundamentally incorrect approach. OP's technique will freely chop combining code points that needed to be kept.
Now, extended grapheme cluster enumeration is much more complex than finding the next non-continuation byte (or counting such), but to perform those correctly you would ultimately end up reading the official spec at unicode.org and perusing reference implementations like ICU (which is painful to read) or from standard library/popular packages for Rust/Java/C#/Swift (the decent ones I'm aware of, do not look at C++).
In Perl (OP's chosen language) you can use the Unicode::Util package. That's why I was pretty clear that you can use a different language or a different library. This seems to be a pretty uncharitable reading of my post. Use the right tool for the job.