Hacker News new | past | comments | ask | show | jobs | submit login

Per wikipedia, "the smallest unit of a writing system of any given language" is a grapheme. Note that this has nothing to do with Unicode. It's just the nature of human text. English folk typically use the word "character" to refer to the same concept.

Unicode models this concept with grapheme clusters. Per that model, GCs should in principle be the fundamental tokenizing unit that feeds into general purpose text parsing software.

But pragmatics may determine otherwise. Just as some tokenizing tools/functions constrain themselves to ASCII bytes, but then break when processing non-ASCII, so too other tokenizing tools/functions constrain themselves to codepoints, but then break if their input contains graphemes that are multi-codepoint graphemes, eg a huge quantity of the text written online in 2019.




The grapheme is the smallest semantic unit of human-readable text. It's not the smallest unit of textual formats, the unicode scalar is.

Code that parses text for human semantic meaning would want to use the grapheme cluster as the smallest unit, but that's a vanishingly small amount of the overall text parsing code. Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

As a trivial example, if I have a line of simple CSV (simple as in no quoting or escapes), it should be obvious that the fields can contain anything except a comma. Except that's not true if you parse it using grapheme clusters, because all I have to do is start one of the fields with a combining mark, and now the CSV parser will skip over the comma and hand me back a single field containing the comma-separated data that belonged in two field.

Or to be slightly more complex, let's say I as a user can control a single string field for a JSON blob that gets stored in a database, and you're using a JSON parser that parses using grapheme clusters. If I start my string field with a combining mark, it will serialize to JSON just fine, but when you go to retrieve it from your database later you'll discover that you can't decode the JSON, because you're not detecting the open quote surrounding my string value.


Thanks. I think I now understand what your point was/is.

> The grapheme is the smallest semantic unit of human-readable text.

Fwiw, quoting wikipedia: "An individual grapheme may or may not carry meaning".

> Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

I agree that formats defined in terms of codepoints need to be tokenized and parsed in terms of codepoints.

And one wouldn't expect there to be (m)any formats defined in terms of GCs as the fundamental token unit, partly because of the problem of defining and implementing suitable behavior for dealing with accidentally or maliciously misplaced combining characters.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: