Iterating grapheme clusters is OK, as long as you know that iterating grapheme c...

raphlinus · on May 30, 2019

The rules for what you want to do on backspace are complex - you want to delete the grapheme cluster if it's an emoji or ideograph with variation selector, but if it's a combining mark, most of the time you want to just delete that. One place this is written down is [1].

Of course, this might sound like a nitpick but only confirms the actual point you were making, that treating text as a sequence of grapheme clusters is often but not always the right way to view the problem.

If you're talking about cursor motion when hitting an arrow key, then yeah, grapheme cluster.

[1]: https://github.com/xi-editor/xi-editor/blob/master/rust/core...

lilyball · on May 30, 2019

macOS and iOS delete the entire grapheme cluster on backspace, not just the combining mark (which is to say, backspace with no selection is identical to shift-left to select the previous character and then hitting backspace).

svat · on May 31, 2019

Not sure what scripts you intended your comment about, but this is not true in general. If I type anything like किमपि (“kimapi”) and hit backspace, it turns into किमप (“kimapa”). That is, the following sequence of codepoints:

    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA
    ‎093F DEVANAGARI VOWEL SIGN I

made of three grapheme clusters (containing 2, 1, and 2 codepoints respectively), turns after a single backspace into the following sequence:

    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA

This is what I expect/find intuitive, too, as a user. Similarly अन्यच्च is made of 3 grapheme clusters but you hit backspace 7 times to delete it (though there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this).

lilyball · on May 31, 2019

Looks like you're right. I don't have experience with languages like this one. I was thinking more of things like é (e followed by U+301), or 🇦🇧 (which is two regional indicator symbols that don't map to any current flag), or a snippet of Z̛̺͉̤̭͈̙A̧̦͉̗̩̞͙LG͈͎͍̺̖̹̘O̵̫ which has tons of combining marks but each cluster is still deleted with a single backspace.

raphlinus · on May 31, 2019

Interesting. The rules seem to be different on different systems. Deleting two RIS symbols (whether they map to a flag or not) seems right in any case. Some other systems (Android included) will take the accents off separately when they are decomposed (but not for precomposed accented characters). Also note macOS takes just the accent off for Arabic (tested on U+062F U+064D).

raiph · on May 30, 2019

Per wikipedia, "the smallest unit of a writing system of any given language" is a grapheme. Note that this has nothing to do with Unicode. It's just the nature of human text. English folk typically use the word "character" to refer to the same concept.

Unicode models this concept with grapheme clusters. Per that model, GCs should in principle be the fundamental tokenizing unit that feeds into general purpose text parsing software.

But pragmatics may determine otherwise. Just as some tokenizing tools/functions constrain themselves to ASCII bytes, but then break when processing non-ASCII, so too other tokenizing tools/functions constrain themselves to codepoints, but then break if their input contains graphemes that are multi-codepoint graphemes, eg a huge quantity of the text written online in 2019.

lilyball · on May 30, 2019

The grapheme is the smallest semantic unit of human-readable text. It's not the smallest unit of textual formats, the unicode scalar is.

Code that parses text for human semantic meaning would want to use the grapheme cluster as the smallest unit, but that's a vanishingly small amount of the overall text parsing code. Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

As a trivial example, if I have a line of simple CSV (simple as in no quoting or escapes), it should be obvious that the fields can contain anything except a comma. Except that's not true if you parse it using grapheme clusters, because all I have to do is start one of the fields with a combining mark, and now the CSV parser will skip over the comma and hand me back a single field containing the comma-separated data that belonged in two field.

Or to be slightly more complex, let's say I as a user can control a single string field for a JSON blob that gets stored in a database, and you're using a JSON parser that parses using grapheme clusters. If I start my string field with a combining mark, it will serialize to JSON just fine, but when you go to retrieve it from your database later you'll discover that you can't decode the JSON, because you're not detecting the open quote surrounding my string value.

raiph · on June 1, 2019

Thanks. I think I now understand what your point was/is.

> The grapheme is the smallest semantic unit of human-readable text.

Fwiw, quoting wikipedia: "An individual grapheme may or may not carry meaning".

> Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

I agree that formats defined in terms of codepoints need to be tokenized and parsed in terms of codepoints.

And one wouldn't expect there to be (m)any formats defined in terms of GCs as the fundamental token unit, partly because of the problem of defining and implementing suitable behavior for dealing with accidentally or maliciously misplaced combining characters.

weberc2 · on May 30, 2019

In my naive opinion, it seems like a good choice then for languages (the ones with first-class utf-8 support, anyway) to operate on scalars and leave graphemes / user-text-editing-use-cases to libraries. (This is meant as an extention to your comment, not in contradiction to it).

lilyball · on May 30, 2019

I'm glad that Swift has first-class support for grapheme clusters. It's just very irritating that they made it the default way to interact with strings.

comex · on May 30, 2019

Why not just use bytes? Most text parsing operations of the sort I think you’re describing can be done on UTF-8 bytes just as well as on codepoints, faster and without sacrificing correctness.

lilyball · on May 30, 2019

Up until Swift 5, Swift's String type was backed by UTF-16 (except for all-ASCII native strings, which just stored ASCII). Even with Swift 5, it's sometimes backed by UTF-16 (namely, when it contains an NSString bridged from Obj-C code that contains non-ASCII characters, which can happen even in pure-Swift code due to all of the String APIs that are really just wrappers around Obj-C Foundation APIs) and sometimes backed by UTF-8.

In truly performance-sensitive code with Swift 5 I will go ahead and use the UTF-8 view with the assumption that input strings are backed by UTF-8, and even force it to native UTF-8 if I'm doing enough processing that the potential string copy is outweighed by the savings during processing, but that's something that's only worth dealing with if there's a clear benefit to doing so. In most cases it's simpler just to use the unicode scalar view, as that doesn't have the potential for having to map UTF-8 sub-scalar offsets into a UTF-16 backing store (whereas unicode scalar offsets always lie on both UTF-8 and UTF-16 code unit boundaries).

All that said, I would have been much happier if Swift could have been 100% UTF-8 from the get-go, which would drastically simplify a lot of this stuff. But the requirement for bridging to/from NSString makes that untenable as it would otherwise involve a lot of string copying every time you cross the Swift/Obj-C boundary.