Per wikipedia, "the smallest unit of a writing system of any given language" is ...

lilyball · on May 30, 2019

The grapheme is the smallest semantic unit of human-readable text. It's not the smallest unit of textual formats, the unicode scalar is.

Code that parses text for human semantic meaning would want to use the grapheme cluster as the smallest unit, but that's a vanishingly small amount of the overall text parsing code. Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

As a trivial example, if I have a line of simple CSV (simple as in no quoting or escapes), it should be obvious that the fields can contain anything except a comma. Except that's not true if you parse it using grapheme clusters, because all I have to do is start one of the fields with a combining mark, and now the CSV parser will skip over the comma and hand me back a single field containing the comma-separated data that belonged in two field.

Or to be slightly more complex, let's say I as a user can control a single string field for a JSON blob that gets stored in a database, and you're using a JSON parser that parses using grapheme clusters. If I start my string field with a combining mark, it will serialize to JSON just fine, but when you go to retrieve it from your database later you'll discover that you can't decode the JSON, because you're not detecting the open quote surrounding my string value.

raiph · on June 1, 2019

Thanks. I think I now understand what your point was/is.

> The grapheme is the smallest semantic unit of human-readable text.

Fwiw, quoting wikipedia: "An individual grapheme may or may not carry meaning".

> Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

I agree that formats defined in terms of codepoints need to be tokenized and parsed in terms of codepoints.

And one wouldn't expect there to be (m)any formats defined in terms of GCs as the fundamental token unit, partly because of the problem of defining and implementing suitable behavior for dealing with accidentally or maliciously misplaced combining characters.