Hacker News new | past | comments | ask | show | jobs | submit login

Iterating code points is OK, as long as you know that know that iterating code points is not the same as iterating grapheme clusters aka user perceived characters. You get away with it most of the time, but you should know you are not dealing with full Unicode and have a plan to deal with exceptions. Unicode normalization does not solve it all.

Unfortunately almost all "Absolute minimum you must know about Unicode" articles don't cover the absolute minimum you have to know about Unicode.

Arbitrary well formed UTF-8 combined with advanced string algorithms and data structures where the unit is 'char' requires more than code points.




Iterating grapheme clusters is OK, as long as you know that iterating grapheme clusters is not the same as iterating unicode scalars (code points) aka the fundamental unit of textual parsing grammars.

This is something that really bugs me about how Swift changed its mind and made the String type a Collection of Characters (i.e. grapheme clusters). Originally they recognized this issue and required you to write `str.characters` to work with the grapheme clusters as a collection (and String itself wasn't a collection at all), but then in Swift 3 (I think) they changed course and said String is a collection after all. And the problem is now people work with Characters without even thinking about it when they really should be working with unicode scalars.

In my personal experience, I only ever actually want to work with grapheme clusters when I'm doing something relating to user text editing (for example, if the user hits delete with an empty selection, I want to delete the last grapheme cluster). Most of my string manipulation wants to operate on scalars instead.


The rules for what you want to do on backspace are complex - you want to delete the grapheme cluster if it's an emoji or ideograph with variation selector, but if it's a combining mark, most of the time you want to just delete that. One place this is written down is [1].

Of course, this might sound like a nitpick but only confirms the actual point you were making, that treating text as a sequence of grapheme clusters is often but not always the right way to view the problem.

If you're talking about cursor motion when hitting an arrow key, then yeah, grapheme cluster.

[1]: https://github.com/xi-editor/xi-editor/blob/master/rust/core...


macOS and iOS delete the entire grapheme cluster on backspace, not just the combining mark (which is to say, backspace with no selection is identical to shift-left to select the previous character and then hitting backspace).


Not sure what scripts you intended your comment about, but this is not true in general. If I type anything like किमपि (“kimapi”) and hit backspace, it turns into किमप (“kimapa”). That is, the following sequence of codepoints:

    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA
    ‎093F DEVANAGARI VOWEL SIGN I
made of three grapheme clusters (containing 2, 1, and 2 codepoints respectively), turns after a single backspace into the following sequence:

    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA
This is what I expect/find intuitive, too, as a user. Similarly अन्यच्च is made of 3 grapheme clusters but you hit backspace 7 times to delete it (though there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this).


Looks like you're right. I don't have experience with languages like this one. I was thinking more of things like é (e followed by U+301), or 🇦🇧 (which is two regional indicator symbols that don't map to any current flag), or a snippet of Z̛̺͉̤̭͈̙A̧̦͉̗̩̞͙LG͈͎͍̺̖̹̘O̵̫ which has tons of combining marks but each cluster is still deleted with a single backspace.


Interesting. The rules seem to be different on different systems. Deleting two RIS symbols (whether they map to a flag or not) seems right in any case. Some other systems (Android included) will take the accents off separately when they are decomposed (but not for precomposed accented characters). Also note macOS takes just the accent off for Arabic (tested on U+062F U+064D).


Per wikipedia, "the smallest unit of a writing system of any given language" is a grapheme. Note that this has nothing to do with Unicode. It's just the nature of human text. English folk typically use the word "character" to refer to the same concept.

Unicode models this concept with grapheme clusters. Per that model, GCs should in principle be the fundamental tokenizing unit that feeds into general purpose text parsing software.

But pragmatics may determine otherwise. Just as some tokenizing tools/functions constrain themselves to ASCII bytes, but then break when processing non-ASCII, so too other tokenizing tools/functions constrain themselves to codepoints, but then break if their input contains graphemes that are multi-codepoint graphemes, eg a huge quantity of the text written online in 2019.


The grapheme is the smallest semantic unit of human-readable text. It's not the smallest unit of textual formats, the unicode scalar is.

Code that parses text for human semantic meaning would want to use the grapheme cluster as the smallest unit, but that's a vanishingly small amount of the overall text parsing code. Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

As a trivial example, if I have a line of simple CSV (simple as in no quoting or escapes), it should be obvious that the fields can contain anything except a comma. Except that's not true if you parse it using grapheme clusters, because all I have to do is start one of the fields with a combining mark, and now the CSV parser will skip over the comma and hand me back a single field containing the comma-separated data that belonged in two field.

Or to be slightly more complex, let's say I as a user can control a single string field for a JSON blob that gets stored in a database, and you're using a JSON parser that parses using grapheme clusters. If I start my string field with a combining mark, it will serialize to JSON just fine, but when you go to retrieve it from your database later you'll discover that you can't decode the JSON, because you're not detecting the open quote surrounding my string value.


Thanks. I think I now understand what your point was/is.

> The grapheme is the smallest semantic unit of human-readable text.

Fwiw, quoting wikipedia: "An individual grapheme may or may not carry meaning".

> Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

I agree that formats defined in terms of codepoints need to be tokenized and parsed in terms of codepoints.

And one wouldn't expect there to be (m)any formats defined in terms of GCs as the fundamental token unit, partly because of the problem of defining and implementing suitable behavior for dealing with accidentally or maliciously misplaced combining characters.


In my naive opinion, it seems like a good choice then for languages (the ones with first-class utf-8 support, anyway) to operate on scalars and leave graphemes / user-text-editing-use-cases to libraries. (This is meant as an extention to your comment, not in contradiction to it).


I'm glad that Swift has first-class support for grapheme clusters. It's just very irritating that they made it the default way to interact with strings.


Why not just use bytes? Most text parsing operations of the sort I think you’re describing can be done on UTF-8 bytes just as well as on codepoints, faster and without sacrificing correctness.


Up until Swift 5, Swift's String type was backed by UTF-16 (except for all-ASCII native strings, which just stored ASCII). Even with Swift 5, it's sometimes backed by UTF-16 (namely, when it contains an NSString bridged from Obj-C code that contains non-ASCII characters, which can happen even in pure-Swift code due to all of the String APIs that are really just wrappers around Obj-C Foundation APIs) and sometimes backed by UTF-8.

In truly performance-sensitive code with Swift 5 I will go ahead and use the UTF-8 view with the assumption that input strings are backed by UTF-8, and even force it to native UTF-8 if I'm doing enough processing that the potential string copy is outweighed by the savings during processing, but that's something that's only worth dealing with if there's a clear benefit to doing so. In most cases it's simpler just to use the unicode scalar view, as that doesn't have the potential for having to map UTF-8 sub-scalar offsets into a UTF-16 backing store (whereas unicode scalar offsets always lie on both UTF-8 and UTF-16 code unit boundaries).

All that said, I would have been much happier if Swift could have been 100% UTF-8 from the get-go, which would drastically simplify a lot of this stuff. But the requirement for bridging to/from NSString makes that untenable as it would otherwise involve a lot of string copying every time you cross the Swift/Obj-C boundary.


I found that even a suggestion to use "grapheme clusters" is misleading. People like to think that there is one kind of grapheme clusters, namely one specified in the UAX #29 [1], but that's just the default and the UAX is pretty much clear about it! Consider an `ij` digraph in Dutch [2] that should count as a single character as a motivating example. "Code points", or rather "Unicode scalar values" are formally defined and not changing; "grapheme clusters" are generally locale-dependent.

I think that we should ask what people want to do with "characters" instead:

- If you want an array of very short strings, use an array. Do not abuse strings. Recent languages even don't like it.

- If you want a string that is not too long when displayed, first and foremost try to resolve that on the actual display (say, with CSS). If that's really impossible, pick a font and actually measure. Make sure that you are using the actual size being printed---the bounding box is not linearly scaled when the font size changes.

- If that display is a terminal, you may also consider using the East Asian Width [3]. But keep in mind that the default width of "Ambiguous" characters greatly varies across locales (Asians tend to prefer dual-width ambiguous characters, for example).

- If you want a string that is not too long when stored, use the encoded byte size. Oh, do you want to make sure that certain languages are not disadvantaged? Do your research to determine the appropriate limit per language then, but I bet you would be much better with generous limits (cough cough Twitter cough).

- (While this is not counting characters, for the completeness,) if you want to navigate a string with a cursor or a keyboard, then default grapheme clusters may actually fit a bill, as that might be the best possible! Someone will obviously complain though.

- If you have to, uh, really, really count characters, the chance is that you can assume a particular script and/or language. Reject others and use the local convention. You may want to read about the Unicode Script Property [4].

[1] https://unicode.org/reports/tr29/

[2] https://en.wikipedia.org/wiki/IJ_(digraph)

[3] https://www.unicode.org/reports/tr11/

[4] https://www.unicode.org/reports/tr24/


You are correct. The lesson is that code points are low level representation and the semantics does not transfer across languages in Unicode. If you don't know the context you can get characters that don't render correctly or you may split string into parts in the wrong

UTF-8 strings can be treated as ASCII or Latin-1 replacements. If you want to deal with full Unicode with all cases you need locale.

Exercise for the reader: Try to write radix tree data and rope data structure that works with every language in all cases with Unicode.

http://cldr.unicode.org/ is your friend.


If you want to navigate a string, you should stop "inside" compatibility ligatures (fi, U+FB01 being the canonical example).


That is why default grapheme clusters only concern boundaries aligned with "starters" or base characters (and thus are preserved after canonical normalizations). U+FB01 is only compatibly decomposable and harder to deal efficiently.


Do you know any "absolute minimum you must know about Unicode" articles that do go into enough depth?


One key to know is that encoding (UTF-8, UTF-16, UTF-32) is a completely separate problem from rendering text. I have had a couple people say to me recently something along the lines of, "We don't need text shaping since UTF-8 takes care of it." That isn't remotely true. An encoding gets you a series of Unicode code points. To render this, these code points must get the bidirectional algorithm applies (bidi) and then these "runs" from the bidi algoritm are then shaped. The text shaper uses OpenType tables within the font to convert these code points into a series of glyph indices with x/y offsets. The renderer then works entirely on glyphs, which might not even map back to a code point in the font.

The HarfBuzz manual touches on some of this: https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.ht...


Unfortunately I don't. I started to learn Unicode, then realized how complicated it is to do right and stopped because I realized that nobody really cares if it works almost all the time.

As Joel below demonstrates, you can get away with 29 languages by treating code points as characters and without knowing about grapheme clusters and other stuff.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

>When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.


Not really relevant. That just demonstrates that displaying those languages works adequately; it doesn't show anything about other processing that your software might care about (e.g. sorting, searching, case conversion, keyboard input, selection and editing, etc.)


> As Joel below demonstrates, you can get away with 29 languages by treating code points as characters and without knowing about grapheme clusters and other stuff.

If you treat text as completely opaque it does work fine. Issues crop up when you want or need to manipulate said text, either to extract information or to modify it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: