This is most definitely not a solved problem, because graphemes (visual symbols) are a poor way to deal with unicode in the real world.
What do you think how text editing controls work? You cursor moves one grapheme cluster at a time, selections start and end at grapheme cluster boundaries, and pressing backspace once deletes the last grapheme cluster even if it took you several key strokes to enter. Grapheme cluster are obviously useful and certainly not a poor way to deal with Unicode in the real world.
Sure, grapheme clusters are neither the most common way to talk about strings, nor are they the most useful one in all situations, but nobody claimed that. If you have to allocate storage, you of course use the size in bytes after encoding. If you translate between encodings, you may want to look at code points. The right tool for the job, and sometimes the right tool is grapheme clusters.
There isn't any such thing as "characters" in code.
Sure, there is. Actually characters exist only in code, they are not used in any field dealing with written language besides computing. A character is the smallest unit of text a computer system can address.
Backspace is typically not one grapheme at a time, though it is for emoji. For scripts such as Arabic, it typically deletes ḥarakāt when they are composed on top of a base character. For a bit more discussion of how I hope to handle this in xi-editor, as well as links to the logic in Android, see https://github.com/google/xi-editor/issues/159
clarification: It is for some emoji, e.g. backspace on a family emoji will eliminate family members one by one. (on most browsers and platforms afaict). But flag emoji will be deleted as a group. IIRC handling of multicodepoint profession emoji is inconsistent.
What do you think how text editing controls work? You cursor moves one grapheme cluster at a time, selections start and end at grapheme cluster boundaries, and pressing backspace once deletes the last grapheme cluster even if it took you several key strokes to enter. Grapheme cluster are obviously useful and certainly not a poor way to deal with Unicode in the real world.
Sure, grapheme clusters are neither the most common way to talk about strings, nor are they the most useful one in all situations, but nobody claimed that. If you have to allocate storage, you of course use the size in bytes after encoding. If you translate between encodings, you may want to look at code points. The right tool for the job, and sometimes the right tool is grapheme clusters.
There isn't any such thing as "characters" in code.
Sure, there is. Actually characters exist only in code, they are not used in any field dealing with written language besides computing. A character is the smallest unit of text a computer system can address.