UTF-8 String Indexing Strategies

maxdamantus · on May 30, 2019

Is there any real case where code point indexing is useful? It seems like all these attempts to restrict strings in such a way to accommodate code points is just introducing complexity with no gain.

UTF-8 was designed to be an encoding (of code points) on top of the "bytes" abstraction just as Unicode is designed to be an encoding (of human text) on top of the "code points" abstraction. I think it should be uncontroversial that there are very good reasons to at least handle arbitrary sequences of code points (eg, you want to be able to handle input from future versions of Unicode, and you don't know about the grapheme clustering of those code points), but I don't see a good reason not to handle arbitrary sequences of bytes.

The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons), but this just seems like a matter of when the information is lost (is it during conversion from string to UTF-16, or from bytes to string), not if it is lost.

As far as I can tell with the Python story, for example, people decided to add special "Unicode" strings into Python 2, then presumably some code used the "Unicode" strings and some code used the "byte" strings, so this situation is obviously underisable .. then in Python 3 they tried fixing it by replacing which sort of string was the default. Why would it not have been better to just improve Unicode support for the existing strings instead of splitting the type into two and forcing everyone to decide whether their strings are for "bytes" or for "Unicode"?

Rusky · on May 30, 2019

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons), but this just seems like a matter of when the information is lost (is it during conversion from string to UTF-16, or from bytes to string), not if it is lost.

And you can even avoid losing that information at all with encodings like WTF-8: https://simonsapin.github.io/wtf-8/

dtech · on May 30, 2019

Programmers are rarely interested in individual bytes, except sometimes if a string is abused as a byte array. In all other cases iterating or indexing over characters is the intention, and code points are the proper abstraction for characters, not bytes.

Also I might be wrong, but you can just look at the bytes to know how much bytes a UTF-8 character is, since the first byte with value 0 to 127 represents the final byte while 128 to 255 represents that the next byte is part of the character.

mort96 · on May 30, 2019

Except that code points aren't the proper abstraction for "characters". Most people would think of <Family: Woman, Woman, Girl, Boy>[1] as one character, but it's really five code points; woman, zero width joiner, woman, zero width joiner, girl, zero width joiner, boy. If you tried doing an operation like reversing a string or removing the last character, and you treated a unicode code point as a "character", you would end up with the wrong result. If you just removed the last code point to implement a backspace, you would end up with a string which ends in a zero width joiner, which makes little sense; and when the user wants to insert, say, a girl emoji, that emoji will end up as a part of the family due to that trailing joiner, when the user expected it to be a separate emoji.

This applies to more than just emojis by the way; there are languages whose unicode representation is much more complicated than english or other languages with latin characters.

[1]: https://emojipedia.org/family-woman-woman-girl-boy/

EDIT: This comment originally used the actual emojis as examples, but hacker news just replaced every code point in the emoji with a space.

a1369209993 · on May 30, 2019

You don't even need emoji; eg U+63,U+300 (c̀) is one characters but two code points, and U+1C4 (Ǆ) is two characters, but one code point. There's also U+1F3,U+308 (ǳ̈), which is two characters in two code points, but segments incorrectly if you split on code points instead of characters.

It's ambigous how to encode latin-small-a-with-ring-above (U+E4 vs U+61,U+30A). Decoding is also ambigous (most infamously Han grapheme clusters), but I'm not fluent enough in any of the affected languages to have a ready example.

Also, that's seven code points, not five.

speleo_engr · on May 30, 2019

You example is a little confusing from the use of the word "characters". I think glyphs would be more clear (U+1C4 is two glyphs). Though it might not actually be 2 glyphs, its dependent on how the font implements it.

At the end of the day, an OpenType font through substitutions can do far more crazy things than these "double glyph" examples. I once made a font that took a 3 letter acronym, substituted this for a hidden empty glyph in the font, then substituted this single hidden empty glyph into 20 glyphs. You were left with something like your U+1C4 in software where you could only highlight all 20 glyphs of none of them. And this was happening on text where all input code points were under 127. People often don't realize how much logic and complexity can be put into a font or how much the font is responsible for doing.

tialaramex · on May 30, 2019

"Squiggles" is my preferred word here. There are a bunch of technical terms like Glyph and Codepoint and Grapheme, but I find squiggles are often what somebody wanted when they used something that works on "characters" and are disappointed with the results.

Advice elsewhere is correct. You almost certainly don't want anything in your high level language other than strings. No "array of characters" no slicing, all that stuff is probably smell. They're like the API in many crypto libraries for doing ECB symmetric encryption. 99% of people using it would have been better off if the Stack Overflow answer they're about to read is explaining what they should be doing instead.

a1369209993 · on May 31, 2019

No, "characters" is the correct term; U+1C4 is two characters: latin-capital-d followed by latin-capital-z-with-caron (or whatever you want to call the little v thing). As you note, this means that non-buggy fonts will generally use two glyphs to render it, but that's a implementation detail; a font could process the entire word "the" as one glyph, or render "m" as dotless-i + right-half-of-n + right-half-of-n, but that wouldn't affect how many characters either string has.

speleo_engr · on June 1, 2019

Using characters is confusing because I don't know if you mean before or after shaping. U+1C4 is unquestionably a single Unicode code point. I've heard people call this 1 logical character. Other people might say how many characters it requires for encoding in UTF-8 or in UTF-16. After shaping, some people might say it is 1 or 2 "shaped characters". It's all horribly confusing. I find using the term code point more precise.

a1369209993 · on June 1, 2019

There's no shaping involved; I'm not talking about the implementation details of the rendering algorithm. There is a D, followed by a Ž. This only seems confusing because Unicode (and - to be fair - other, earlier character encodings) willfully misinterprets the term "character" for self-serving purposes.

Someone · on May 30, 2019

Swift handles both cases well:

    var test = "test\u{1F469}"
    var tset = String(test.reversed())
    var tes = test.dropLast()

(The second line needs the extra ‘String’ to turn a sequence into a String; and yes, the names of the variable do not match their content)

raiph · on May 30, 2019

Doing it with just one emoji sort of misses the point...

One way in P6, combining the full family into one character:

    my \test = "test\c[Woman,ZWJ,Woman,ZWJ,Girl,ZWJ,Boy]";
    say test.chars; # 5
    say test;       # test   
    say flip test;  #    tset
    say test.chop;  # test

HN displays the 7 codepoint family as three spaces.

To see that P6 treats the family as one:

https://tio.run/##K0gtyjH7/18BCHIrFWJKUotLFGwV1EG0ukKdQmleZk...

Someone · on May 31, 2019

Sorry. Tested with the family, but didn’t notice that I only took the single code point when making _something_ show upon HN. It also works for

  “test\u{1F469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}"

jrochkind1 · on May 30, 2019

A codepoint is the smallest unit of meaning in unicode. A byte is just a number, that might (or might not) have meaning in a specific unicode encoding. (Also depending on what other bytes it's next to).

A codepoint is the smallest unit that has a graphical representation you can print on screen.

A codepoint is the smallest unit that allows API's that are agnostic to encoding, just in terms of the semantic content of unicode. If you want to write any kind of algorithm in terms of the actual character meaning (charecters represented), you want a codepoint abstraction. Most unicode algorithms -- like for collation, normalization, regexp character classes -- are in terms of codepoints.

If you split a unicode string on codepoints, the results are always valid unicode strings. If you split a unicode string on bytes, they may not be.

Human written language is complicated. Unicode actually does a pretty amazing job of providing an abstraction for dealing with it, but it's still complicated. It's true that it would be a (common) misconception to think that a codepoint always represents "one block on the screen", a "user-perceived character", (a "grapheme cluster"). If you start really getting into it, you realize "a user-perceived character" is a more complex concept than you thought/would like; not because of unicode but because of the complexities of global written human language and what software wants to do with it. But most people who have tried writing internationalized text manipulation of any kind with an API that is only in terms of bytes -- will know that codepoints is definitely superior.

If you do need "user-perceived characters" aka "grapheme clusters" -- unicode has an algorithm for that, based on data for each codepoint in the unicode database. https://unicode.org/reports/tr29/ It can be locale-dependent (whereas codepoints are locale independent). And guess what, the algorithm is in terms of codepoints -- if you wanted to implement the algorithm, you would usually want an API based on a codepoint abstraction to start with.

The "grapheme cluster" abstraction is necessarily more expensive to deal with than the "codepoint" abstraction (which is itself necessarily more expensive than "bytes") -- "codepoint" is quite often the right balance. I suppose if computers were or got another couple of magnitudes faster, we might all want/demand more widespread implementation of "grapheme cluster" as the abstraction for many more things -- but it'd still be described and usually implemented in terms of the "codepoint" abstraction, and you'd still need the codepoint abstraction for many things, such as normalization. But yes, it would be nice if more platforms/libraries provided "grapheme cluster" abstraction too. But it turns out you can mostly get by with "codepoint". You can't really even get by with just bytes if you want to do any kind of text manipulation or analysis (such as regexp). And codepoint is the abstraction on which "grapheme cluster" is built, it's the lower level and simpler abstraction, so is the first step -- and some platforms have only barely gotten there. A "grapheme cluster" is made up of codepoints.

I suppose one could imagine some system that isn't unicode that doesn't use a "codepoint" abstraction but somehow only had "user-perceived characters"... but it would get pretty crazy, for a variety of reasons including but not limited to that "user-perceived character" can be locale-dependent. "codepoint" is a very good and useful abstraction, and is the primary building block of unicode, so it makes sense that unicode-aware platform APIs also use it as a fundamental unit. A codepoint is the unit on which you can look up metadata in the unicode database, for normalization, upper/lowercasing, character classes for regexps, collation (sort order), etc. Unicode is designed to let you do an awful lot with codepoints, in as performant a manner as unicode could figure out.

wahern · on May 30, 2019

> If you split a unicode string on codepoints, the results are always valid unicode strings.

"One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid for what?"

Source: http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0034.h...

The only thing you get by splitting a sequence of codepoints at random is another sequence of codepoints. Because you can end up with codepoint sequences that map to different glyphs or end up being ignored when they wouldn't have been in their proper order, you can end up with non-sense. You can shuffle a sequence of ASCII characters and still end up with a sequence of ASCII characters. What good is that? I fail to see how it would be qualitatively different than splitting a UTF-8 string at arbitrary code points. The latter is supposed to induce an error, but the former doesn't necessarily. The Unicode specification is written in a way to degrade softly when manipulated or displayed by poorly written software or old software dealing with future sequences with unique semantics. But that's not the same thing as saying that any sequence of codepoints is valid. Rather it's more akin to undefined behavior in C, except without a license to unleash nasal daemons.

maxdamantus · on May 30, 2019

But code points are not an abstraction for characters. The character "á" can be written as two code points (U+61, U+301). If a character can be two code points, why can't it be three bytes?

danieldk · on May 30, 2019

Code points are not the proper abstraction, since a character can be composed from a variable number of code points.

Also, bytes from 0-127 in UTF-8 (MSB is 0) are ASCII characters. In multibyte code points, the MSBs of the first byte are 11 and of continuation bytes 10.

gpderetta · on May 30, 2019

a code unit with value 0-127 always maps to a single code point, i.e. it is never part of a multibyte character. It also maps 1:1 to ASCII values. So, if you are looking for specific characters in the ASCII code set, it is perfectly fine to iterate one byte at a time on an utf-8 string.

jhanschoo · on May 30, 2019

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs

As other top-level comments and the article have mentioned, different systems still use different internal representations, but all modern ones commit to being able to represent Unicode code points, with no automatic normalization. To that end I suppose that in terms of reasoning about the consistency of data in different systems, it is better to use code points than the actual size, which is left to the implementation.

The possibly better alternative would be to use lengths in UTF-8, but that might seem arbitrary to some. Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.

maxdamantus · on May 30, 2019

> Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.

But "á" is two code points (U+61, U+301). If you're looking for some lower bound (whatever that means), shouldn't it be 1? I imagine if you're looking for something like information density, the count of UTF-8 code units would at least be somewhat more informative than the count of code points.

I guess the crux of this whole point is that a sequence of code points is arbitrary in the same way as a sequence of bytes; neither "code point" nor "byte" necessarily corresponds to something that a user would see as a unit in human text. So why are we not using the simpler abstraction?

masklinn · on May 31, 2019

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16

Text is by definition losslessly convertible between UTFs. You only "lose" information if your source is not a correct UTF stream e.g. if you have lone surrogates you don't actually have UTF-16, you have a garbage pile of UTF-16 code units.

Now it may be useful to properly round-trip this garbage pile (e.g. you're dealing with filenames on Windows), but this should not be confused with data conversion between UTFs: your source simply is not UTF-16.

maxdamantus · on May 31, 2019

I'm not sure there's a clear definition of what valid "text" is, but surely classifying it as something like "a sequence of Unicode scalar values" (equivalent to "something that can be encoded in a UTF") is a bit arbitrary. Is a sequence of unassigned Unicode scalar values really "text"? Maybe "text" should not start with a combining character. Maybe it should be grammatically valid in some human language.

Again, unless the point of all of this is to cater for obsolete encodings for the rest of eternity, "sequence of code points" (or here "sequence of Unicode scalar values" [0]) seems just as arbitrary as "sequence of bytes".

[0] Probably worth mentioning that as far as I'm aware, these systems tend to use code points, not Unicode scalar values, so the strings are not guaranteed representable by UTFs anyway (Python allows "\udc80", and even started making use of such strings as a latent workaround to handle non-UTF-8 input some time after Python 3 was released [1])

[1] https://www.python.org/dev/peps/pep-0383/

masklinn · on May 31, 2019

I've no idea what you're trying to say. All I'm saying is

> ensuring that text is losslessly convertible to other UTFs

is a truism, it can't not be true. You can always convert back and forth between UTF8 and UTF16 with no loss or alteration of data.

maxdamantus · on May 31, 2019

It's not a truism if "text" includes bytes that are not UTF-8. A program would probably naturally model a filename as being text, but if that means "sequence of Unicode scalar values", it's arguably incorrect on some obscure systems, such as Linux and git.

masklinn · on May 31, 2019

> It's not a truism if "text" includes bytes that are not UTF-8.

"other UTFs" assume it's in one UTF to start with. If it's in a random non-encoding, possibly no encoding at all, there's no point in attempting a conversion, you can't turn random garbage into non-garbage.

> A program would probably naturally model a filename as being text

That's incorrect modelling and a well known error, as discussed in other sub-threads.

maxdamantus · on May 31, 2019

> "other UTFs" assume it's in one UTF to start with

Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

> That's incorrect modelling and a well known error, as discussed in other sub-threads.

It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

masklinn · on May 31, 2019

> Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

> It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

The entire point of separate string-like type is making it clear what's what up-front. It doesn't require any sort of internationalisation, it just tells you that filenames are not strings. Because they're not.

> An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

Go just throws up its hand and goes "fuck you, strings are random aggregates of garbage and our "string functions" will at best blow up and at worst corrupt the entire thing if you use them" (because they either assert or assume that strings are UTF8-encoded).

It only "handles this stuff fine" in the sense that it does not handle it at all, will not tell you that your code is going to break, and will provide no support whatsoever when it does.

maxdamantus · on May 31, 2019

> That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

As a last resort to try and clarify what I meant in my original post (requoted below):

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons)

What I mean here is that if converting to a UTF is important, then maybe restricting strings to code points or Unicode scalar values is justified. If textual data is stored in bytes that are conventionally UTF-8, there should be no need to do any conversion to a UTF, since ultimately the only UTF that is useful should be UTF-8. All you would be doing by "converting to a UTF" is losing information.

That was my last attempt. I'm sorry if you still can't understand it.

ChrisSD · on May 30, 2019

> One issue to consider is that strings typically feature random access indexing of code points

True but I find it's much rarer to actually need random access to arbitrary code points. Most of the time I either use strings as opaque "things" or I'm iterating through characters to find something interesting (e.g. parsing) where I can build my own index, if and when necessary.

I do agree with the article that if an abstraction is very leaky it's better to be upfront about that.

richardwhiuk · on May 30, 2019

I think that accessing characters by index is _probably_ a code smell in most places, especially if that string may contain arbitrary UTF-8.

masklinn · on May 30, 2019

Accessing string content by arbitrary indices is probably an error. Accessing string content by indices you got from previous lookups is useful for a number of situations.

scatters · on May 30, 2019

If you have an index from a previous lookup, that can be a byte index.

raiph · on May 30, 2019

You're right to the degree that people probably aren't using P6. :)

ben509 · on May 30, 2019

It's usually wrong, but you have a large number of people who don't get regular expressions. It's hard to visualize an automata crawling over a string and handing out matches.

The simpler mental model of splitting and splicing is easily grokable, so there's a lot of utility in supporting it.

nabla9 · on May 30, 2019

Iterating code points is OK, as long as you know that know that iterating code points is not the same as iterating grapheme clusters aka user perceived characters. You get away with it most of the time, but you should know you are not dealing with full Unicode and have a plan to deal with exceptions. Unicode normalization does not solve it all.

Unfortunately almost all "Absolute minimum you must know about Unicode" articles don't cover the absolute minimum you have to know about Unicode.

Arbitrary well formed UTF-8 combined with advanced string algorithms and data structures where the unit is 'char' requires more than code points.

lilyball · on May 30, 2019

Iterating grapheme clusters is OK, as long as you know that iterating grapheme clusters is not the same as iterating unicode scalars (code points) aka the fundamental unit of textual parsing grammars.

This is something that really bugs me about how Swift changed its mind and made the String type a Collection of Characters (i.e. grapheme clusters). Originally they recognized this issue and required you to write `str.characters` to work with the grapheme clusters as a collection (and String itself wasn't a collection at all), but then in Swift 3 (I think) they changed course and said String is a collection after all. And the problem is now people work with Characters without even thinking about it when they really should be working with unicode scalars.

In my personal experience, I only ever actually want to work with grapheme clusters when I'm doing something relating to user text editing (for example, if the user hits delete with an empty selection, I want to delete the last grapheme cluster). Most of my string manipulation wants to operate on scalars instead.

raphlinus · on May 30, 2019

The rules for what you want to do on backspace are complex - you want to delete the grapheme cluster if it's an emoji or ideograph with variation selector, but if it's a combining mark, most of the time you want to just delete that. One place this is written down is [1].

Of course, this might sound like a nitpick but only confirms the actual point you were making, that treating text as a sequence of grapheme clusters is often but not always the right way to view the problem.

If you're talking about cursor motion when hitting an arrow key, then yeah, grapheme cluster.

[1]: https://github.com/xi-editor/xi-editor/blob/master/rust/core...

lilyball · on May 30, 2019

macOS and iOS delete the entire grapheme cluster on backspace, not just the combining mark (which is to say, backspace with no selection is identical to shift-left to select the previous character and then hitting backspace).

svat · on May 31, 2019

Not sure what scripts you intended your comment about, but this is not true in general. If I type anything like किमपि (“kimapi”) and hit backspace, it turns into किमप (“kimapa”). That is, the following sequence of codepoints:

    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA
    ‎093F DEVANAGARI VOWEL SIGN I

made of three grapheme clusters (containing 2, 1, and 2 codepoints respectively), turns after a single backspace into the following sequence:

    ‎0915 DEVANAGARI LETTER KA
    ‎093F DEVANAGARI VOWEL SIGN I
    ‎092E DEVANAGARI LETTER MA
    ‎092A DEVANAGARI LETTER PA

This is what I expect/find intuitive, too, as a user. Similarly अन्यच्च is made of 3 grapheme clusters but you hit backspace 7 times to delete it (though there I'd slightly have preferred अन्यच्च→अन्यच्→अन्य→अन्→अ instead of अन्यच्च→अन्यच्→अन्यच→अन्य→अन्→अन→अ that's seen, but one can live with this).

lilyball · on May 31, 2019

Looks like you're right. I don't have experience with languages like this one. I was thinking more of things like é (e followed by U+301), or 🇦🇧 (which is two regional indicator symbols that don't map to any current flag), or a snippet of Z̛̺͉̤̭͈̙A̧̦͉̗̩̞͙LG͈͎͍̺̖̹̘O̵̫ which has tons of combining marks but each cluster is still deleted with a single backspace.

raphlinus · on May 31, 2019

Interesting. The rules seem to be different on different systems. Deleting two RIS symbols (whether they map to a flag or not) seems right in any case. Some other systems (Android included) will take the accents off separately when they are decomposed (but not for precomposed accented characters). Also note macOS takes just the accent off for Arabic (tested on U+062F U+064D).

raiph · on May 30, 2019

Per wikipedia, "the smallest unit of a writing system of any given language" is a grapheme. Note that this has nothing to do with Unicode. It's just the nature of human text. English folk typically use the word "character" to refer to the same concept.

Unicode models this concept with grapheme clusters. Per that model, GCs should in principle be the fundamental tokenizing unit that feeds into general purpose text parsing software.

But pragmatics may determine otherwise. Just as some tokenizing tools/functions constrain themselves to ASCII bytes, but then break when processing non-ASCII, so too other tokenizing tools/functions constrain themselves to codepoints, but then break if their input contains graphemes that are multi-codepoint graphemes, eg a huge quantity of the text written online in 2019.

lilyball · on May 30, 2019

The grapheme is the smallest semantic unit of human-readable text. It's not the smallest unit of textual formats, the unicode scalar is.

Code that parses text for human semantic meaning would want to use the grapheme cluster as the smallest unit, but that's a vanishingly small amount of the overall text parsing code. Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

As a trivial example, if I have a line of simple CSV (simple as in no quoting or escapes), it should be obvious that the fields can contain anything except a comma. Except that's not true if you parse it using grapheme clusters, because all I have to do is start one of the fields with a combining mark, and now the CSV parser will skip over the comma and hand me back a single field containing the comma-separated data that belonged in two field.

Or to be slightly more complex, let's say I as a user can control a single string field for a JSON blob that gets stored in a database, and you're using a JSON parser that parses using grapheme clusters. If I start my string field with a combining mark, it will serialize to JSON just fine, but when you go to retrieve it from your database later you'll discover that you can't decode the JSON, because you're not detecting the open quote surrounding my string value.

raiph · on June 1, 2019

Thanks. I think I now understand what your point was/is.

> The grapheme is the smallest semantic unit of human-readable text.

Fwiw, quoting wikipedia: "An individual grapheme may or may not carry meaning".

> Any code that parses any kind of machine-readable format does not want to use grapheme clusters.

I agree that formats defined in terms of codepoints need to be tokenized and parsed in terms of codepoints.

And one wouldn't expect there to be (m)any formats defined in terms of GCs as the fundamental token unit, partly because of the problem of defining and implementing suitable behavior for dealing with accidentally or maliciously misplaced combining characters.

weberc2 · on May 30, 2019

In my naive opinion, it seems like a good choice then for languages (the ones with first-class utf-8 support, anyway) to operate on scalars and leave graphemes / user-text-editing-use-cases to libraries. (This is meant as an extention to your comment, not in contradiction to it).

lilyball · on May 30, 2019

I'm glad that Swift has first-class support for grapheme clusters. It's just very irritating that they made it the default way to interact with strings.

comex · on May 30, 2019

Why not just use bytes? Most text parsing operations of the sort I think you’re describing can be done on UTF-8 bytes just as well as on codepoints, faster and without sacrificing correctness.

lilyball · on May 30, 2019

Up until Swift 5, Swift's String type was backed by UTF-16 (except for all-ASCII native strings, which just stored ASCII). Even with Swift 5, it's sometimes backed by UTF-16 (namely, when it contains an NSString bridged from Obj-C code that contains non-ASCII characters, which can happen even in pure-Swift code due to all of the String APIs that are really just wrappers around Obj-C Foundation APIs) and sometimes backed by UTF-8.

In truly performance-sensitive code with Swift 5 I will go ahead and use the UTF-8 view with the assumption that input strings are backed by UTF-8, and even force it to native UTF-8 if I'm doing enough processing that the potential string copy is outweighed by the savings during processing, but that's something that's only worth dealing with if there's a clear benefit to doing so. In most cases it's simpler just to use the unicode scalar view, as that doesn't have the potential for having to map UTF-8 sub-scalar offsets into a UTF-16 backing store (whereas unicode scalar offsets always lie on both UTF-8 and UTF-16 code unit boundaries).

All that said, I would have been much happier if Swift could have been 100% UTF-8 from the get-go, which would drastically simplify a lot of this stuff. But the requirement for bridging to/from NSString makes that untenable as it would otherwise involve a lot of string copying every time you cross the Swift/Obj-C boundary.

lifthrasiir · on May 30, 2019

I found that even a suggestion to use "grapheme clusters" is misleading. People like to think that there is one kind of grapheme clusters, namely one specified in the UAX #29 [1], but that's just the default and the UAX is pretty much clear about it! Consider an `ij` digraph in Dutch [2] that should count as a single character as a motivating example. "Code points", or rather "Unicode scalar values" are formally defined and not changing; "grapheme clusters" are generally locale-dependent.

I think that we should ask what people want to do with "characters" instead:

- If you want an array of very short strings, use an array. Do not abuse strings. Recent languages even don't like it.

- If you want a string that is not too long when displayed, first and foremost try to resolve that on the actual display (say, with CSS). If that's really impossible, pick a font and actually measure. Make sure that you are using the actual size being printed---the bounding box is not linearly scaled when the font size changes.

- If that display is a terminal, you may also consider using the East Asian Width [3]. But keep in mind that the default width of "Ambiguous" characters greatly varies across locales (Asians tend to prefer dual-width ambiguous characters, for example).

- If you want a string that is not too long when stored, use the encoded byte size. Oh, do you want to make sure that certain languages are not disadvantaged? Do your research to determine the appropriate limit per language then, but I bet you would be much better with generous limits (cough cough Twitter cough).

- (While this is not counting characters, for the completeness,) if you want to navigate a string with a cursor or a keyboard, then default grapheme clusters may actually fit a bill, as that might be the best possible! Someone will obviously complain though.

- If you have to, uh, really, really count characters, the chance is that you can assume a particular script and/or language. Reject others and use the local convention. You may want to read about the Unicode Script Property [4].

[1] https://unicode.org/reports/tr29/

[2] https://en.wikipedia.org/wiki/IJ_(digraph)

[3] https://www.unicode.org/reports/tr11/

[4] https://www.unicode.org/reports/tr24/

nabla9 · on May 30, 2019

You are correct. The lesson is that code points are low level representation and the semantics does not transfer across languages in Unicode. If you don't know the context you can get characters that don't render correctly or you may split string into parts in the wrong

UTF-8 strings can be treated as ASCII or Latin-1 replacements. If you want to deal with full Unicode with all cases you need locale.

Exercise for the reader: Try to write radix tree data and rope data structure that works with every language in all cases with Unicode.

http://cldr.unicode.org/ is your friend.

scatters · on May 30, 2019

If you want to navigate a string, you should stop "inside" compatibility ligatures (fi, U+FB01 being the canonical example).

lifthrasiir · on May 30, 2019

That is why default grapheme clusters only concern boundaries aligned with "starters" or base characters (and thus are preserved after canonical normalizations). U+FB01 is only compatibly decomposable and harder to deal efficiently.

mont · on May 30, 2019

Do you know any "absolute minimum you must know about Unicode" articles that do go into enough depth?

speleo_engr · on May 30, 2019

One key to know is that encoding (UTF-8, UTF-16, UTF-32) is a completely separate problem from rendering text. I have had a couple people say to me recently something along the lines of, "We don't need text shaping since UTF-8 takes care of it." That isn't remotely true. An encoding gets you a series of Unicode code points. To render this, these code points must get the bidirectional algorithm applies (bidi) and then these "runs" from the bidi algoritm are then shaped. The text shaper uses OpenType tables within the font to convert these code points into a series of glyph indices with x/y offsets. The renderer then works entirely on glyphs, which might not even map back to a code point in the font.

The HarfBuzz manual touches on some of this: https://harfbuzz.github.io/why-do-i-need-a-shaping-engine.ht...

nabla9 · on May 30, 2019

Unfortunately I don't. I started to learn Unicode, then realized how complicated it is to do right and stopped because I realized that nobody really cares if it works almost all the time.

As Joel below demonstrates, you can get away with 29 languages by treating code points as characters and without knowing about grapheme clusters and other stuff.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...

>When CityDesk publishes the web page, it converts it to UTF-8 encoding, which has been well supported by web browsers for many years. That’s the way all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had any trouble viewing them.

jfk13 · on May 30, 2019

Not really relevant. That just demonstrates that displaying those languages works adequately; it doesn't show anything about other processing that your software might care about (e.g. sorting, searching, case conversion, keyboard input, selection and editing, etc.)

masklinn · on May 30, 2019

> As Joel below demonstrates, you can get away with 29 languages by treating code points as characters and without knowing about grapheme clusters and other stuff.

If you treat text as completely opaque it does work fine. Issues crop up when you want or need to manipulate said text, either to extract information or to modify it.

bakery2k · on May 30, 2019

The article does not mention Python, other than to reference CPython's "Flexible String Representation". However, it's interesting that alternative Python implementations have decided against that model and indeed use UTF-8 strings internally.

MicroPython saves memory by simply making indexing into its strings O(n) [1], while PyPy's UTF-8 strings have "an optional extra index data structure to make indexing O(1)" [2].

For compatibility, of course, Python implementations have to provide indexing of code points - it would be interesting to examine the pros & cons of the different string representations. I wonder if new high-level languages would be better off using one of these representations, or taking the Go/Julia approach of only indexing bytes.

[1] https://github.com/micropython/micropython/blob/a4f1d82757b8...

[2] https://twitter.com/pypyproject/status/1095971192513708032

raiph · on May 30, 2019

"इंडेक्स" का क्या अर्थ है? [1]

Including the quote marks, spaces, and question mark, that's 18 characters. This isn't just about text editing, far from it. For a lot of string processing, indexing into the underlying codepoints is even less interesting than indexing into the underlying bytes.

[1] https://translate.google.com/#view=home&op=translate&sl=hi&t...

onboardram · on May 31, 2019

I am not a linguist, but as a native speaker, shouldn't they be considered 15 characters? क्स, क्या and र्थ each form individual conjunct consonants. Counting them as two would then beget the question as to why डे is not considered two characters too, seeing as it is formed by combining ड and ए, much like क्स is formed by combining क् and स.

raiph · on June 4, 2019

If you say they should be considered 15 characters then software and devs should support optionally indexing and counting them as 15 characters. This is the most important point.

And, as a corollary, software devs should aspire to have and know about string functions in software that recognize that the text string I used is 15 characters long in contexts where that's the right way to view it. Furthermore, those functions should asap be as easily available for use as they are today for recognizing that the text 'What does "index" mean?' is 23 characters long.

This notion of software and devs properly indexing and counting characters was the ultimate point of my comment, as I will elaborate below. I hope that you will reply to confirm you understand the gist of what follows; that would make my day and leave this exchange on HN to hopefully shine light where it's sorely needed. :)

----

The OP title is "UTF-8 String Indexing Strategies". I could write that this begs the question What does "index" mean? Unfortunately it seems it still doesn't beg the question -- in 2019 -- for most western devs.

Last century devs generally assumed the index unit was bytes. So they created programming languages whose string type assumed indexing in bytes and functions and libraries that did the same. Nowadays they're starting to assume "codepoints", which is an equally broken assumption. (Codepoints are a Unicode notion and they're great for what they're great for. But being "characters" is, in the general case, something they're terrible for.)

Both these western devs and the OP are effectively ignoring the possibility that "इंडेक्स" का क्या अर्थ है? could be considered to be 15 characters (or 18). They're ignoring you, the half of the planet that are in a similar boat, and the whole of the planet that's coming together, sharing text like we are here.

----

bakery2k demonstrated the problem. They wrote:

> MicroPython ... indexing ... O(n) ... PyPy's ... O(1)

Neither of these deals with indexing characters, as one might expect based on an ordinary human's understanding of the word "characters". Instead they're myopically focused on indexing bytes and codepoints.

This goes hand-in-hand with Python's length function returning 26 for the text "इंडेक्स" का क्या अर्थ है?. It's counting codepoints, not characters, which is close to useless for that text.[1]

But you wouldn't have any clue about that from bakery2k's comment and it looks like bakery2k has no awareness of this:

> I wonder if new high-level languages would be better off using one of these [byte and codepoint] representations, or taking the Go/Julia approach of only indexing bytes.

Imo that's shockingly retrogressive given the lack of discussion of characters.

----

Chances are good that if you try to select the text I wrote one character at a time you will find that you can cursor across 18 units.

Why/how does software do this? It relies on part of the Unicode standard for indexing that builds on the concept of "what a user thinks of as a character".[2]

This mechanism allows the string to be indexed/counted as N characters, where N varies according to the definition of "character". Software is supposed to choose the definition with appropriate adherence to the Unicode standard, which includes customizing it as necessary to produce practical results. And, as I noted, most good modern software dealing with cursoring/editing text gets it right per the Unicode standard.

My guess the Unicode standard by default has software consider क्स to be 2 characters because the consonant is comprised of क् and स placed visually side-by-side whereas it has डे be considered 1 character because it's comprised of ड and ए somehow overlapping visually. (That's a pure guess. Please let me know if it sounds crazy. :))

For some other use cases, like a native speaker just reading text abstractly, what's a character changes. You say the text I wrote is 15 characters; therefore software should be able to index and count it as 15 characters.

I hope that all makes sense. Thank you for your comment, reading my reply, and TIA for any reply. :)

[1] https://tio.run/##K6gsycjPM/7/v6AoM69EIyc1T0Nd6cGS9gdLmh4sWf...

[2] https://unicode.org/glossary/#grapheme

onboardram · on June 15, 2019

Sorry for the late reply, I don't use HN much. No idea if you'll actually notice this, does HN even have a "Reply Notification" feature?

Regarding what you wrote, I agree pretty much. As I said, I am not an expert in this field, so I am not aware of the most cutting edge stuff put there. But even the few languages I know and have seen are so different from each other (some more than others) that it seems unlikely that a single "theory of everything" would suffice for text, especially in the way we process text presently.

Perhaps there is some way to abstract out the differences, but I don't really see how. After all, characters are where the differences only begin. Start thinking about words or sentences and no single route seems viable for the way we do string processing today.

You probably expected a more substantial comment, but I don't really know enough of this field to make one.

Regarding क्स and डे, the difference between them is that the former is a combination of two consonants (pronounced "ks") while the latter is formed by a consonant and a vowel ("de"). However, looking at the visual representation is wrong, since डा (consonant+vowel) would also look like two characters. If you copy these into a text field and try to erase them through backspace or delete, you should see how it all works (assuming the text field functions correctly).

But again, these confusions only exist because Devnagari allows simple characters to form compound characters. That is obviously completely different than how Roman script works, which is probably completely different than various pictographic languages. So, how to reconcile the differences (except by hiring native speakers of every language out there)? I wish I knew, but currently I don't.

masklinn · on May 30, 2019

It's sad and odd that Rust and (probably especially) Swift are missing from the article.

chrisseaton · on May 30, 2019

Why are there interesting technical differences in the way those languages do things compared to the other examples given?

The author obviously can't cover all languages and strategies in a short article can they?

saagarjha · on May 30, 2019

Yes: Swift groups by grapheme clusters, and Rust makes it difficult to do byte indexing.

masklinn · on May 30, 2019

> Rust makes it difficult to do byte indexing.

Not sure how. If you want to get a specific byte, just convert to a bytes slice (that's free) and index that. And you can slice strings (using byte-indexed indices), but your boundaries have to fall on codepoint boundaries. The only thing that's difficult is getting a codepoint at a specific index (byte or otherwise).

paulddraper · on May 30, 2019

> just convert to a bytes slice (that's free) and index that

Byte indexing of strings.

If you explicitly convert your string to bytes, yeah then naturally it's easy to byte index.

afiori · on May 30, 2019

> Not sure how. If you want to get a specific byte, just convert to a slice (that's free) and index that.

But then it is not automatic to cast that slice as a string.

burntsushi · on May 30, 2019

If you want a single byte and `s` is a `&str`, then `s.as_bytes()[i]` returns a `u8` in `s` at index `i`. If the index `i` is out of bounds, then it panics, but no other UTF-8 checking is performed.

You do not need to do this if you're slicing. For example, if you know that `i..j` indexes a valid UTF-8 subslice of `s`, then `&s[i..j]` returns a subslice of `s` with type `&str`.

The only reason to subslice `s.as_bytes()` is if you want the raw bytes which may or may not be valid UTF-8. And in this case, it is a good thing that it is not automatic to convert that back to a `&str` since it may not be valid UTF-8.

afiori · on May 30, 2019

> it is a good thing that it is not automatic to convert that back to a `&str` since it may not be valid UTF-8.

My comment was unclear in meaning, but the aim was to point out exactly this.

Someone · on May 30, 2019

”So, Emacs pretends it has constant time access into its UTF-8 text data, but it’s only faking it with some simple optimizations. This usually works out just fine.”

Usually, except when you’re writing in, and searching for Chinese, Greek, Hindi, Korean, Russian, Turkish, etc, text, like 50+% of the world’s population? It seems Emacs is made for programmers, who predominantly type and search for ascii text.

_bz2r · on May 30, 2019

according to the article, it is independent of the language, and depends only on the number of strings you are iterating across simultaneously.

munchbunny · on May 30, 2019

That sounds about right. You don't pick up vim for a shopping list. You pick it up because you're a programmer.

anoncake · on May 30, 2019

Some people use Emacs just for Org. That's a lot closer to a shopping list than to programming. And programmers sometimes write text in natural language.

munchbunny · on May 31, 2019

I would guess that the parent comment's point is still true: Emacs (and Vim) are far more commonly used for programming and other work, probably ASCII heavy, than for natural language text editing.

I'd be willing to bet that for both Emacs and Vim, 90%+ characters by volume are ASCII. I wouldn't make a similar bet for Microsoft Word.

johannes1234321 · on May 30, 2019

Which doesn't mean you can't keep your shopping list in it.

Or be using it to build a shopping list tool - including test data.

cjohansson · on May 30, 2019

Thanks for the article, I’m curious of what the author think of string indexing in Rust. It’s also explicit so I guess you would like it as well

burntsushi · on May 30, 2019

AFAIK about Julia, Rust and Julia handle strings similarly. i.e., Strings are represented by UTF-8 internally, and they are required to be valid UTF-8.

I've also built strings in Rust that are only conventionally UTF-8, similar to Go. It's still an experiment though: https://docs.rs/bstr --- It turns out that conventionally UTF-8 strings can be quite useful in a lot of cases, since the real world often provides data without any guaranteed encoding (e.g., the contents of files).

StefanKarpinski · on May 30, 2019

Julia strings are not required to be valid Unicode since 1.0, they can hold arbitrary data. Moreover you can round trip arbitrary data from a file through strings, through chars, then back to disk and you will get an identically file regardless of its content. The principle is this: a program should never error because of broken data, only because of programmer error.

burntsushi · on May 30, 2019

Oh interesting! TIL. I think that probably means that this is UB then: https://github.com/JuliaLang/julia/blob/d8ff21c69c118e8801e8... --- You can't enable NO_UTF_CHECK in PCRE if you're going to pass data that isn't valid UTF-8.

N.B. As of PCRE 10.33, you can able the PCRE2_JIT_INVALID_UTF check for JIT matching instead.

It looks like this is also coming to the standard interpreter as well: https://lists.exim.org/lurker/message/20190524.173112.0d226a...

StefanKarpinski · on May 30, 2019

Yeah the PCRE situation is a bit unfortunate. Avoiding crashes on invalid data would be the minimum and hopefully PCRE does that officially soon. To really make things work well, we would have to patch PCRE to handle what Julia considers invalid characters to be, which is doable, but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.

burntsushi · on May 30, 2019

> Yeah the PCRE situation is a bit unfortunate.

Indeed. In all likelihood, it's a CVE waiting to happen.

> we would have to patch PCRE to handle what Julia considers invalid characters to be

Sorry, did you see my links in the previous comment? This is already available in the JIT engine for PCRE 10.33, and appears to be making its way into the standard interpreter as well. So long as both Julia and PCRE implement UTF-8 correctly, both should be on the same page with respect to invalid UTF-8 byte sequences.

> but it may be better to just reimplement regex functionality in Julia, which is non-trivial, but that way we naturally get correct treatment of invalid data, JIT, and support for other string types.

Yup, this is what I did for Rust, which can work on both completely valid UTF-8 and arbitrary byte sequences. But it is a ton of work. I'd get as much mileage out of PCRE2 as I could.

the_mitsuhiko · on May 30, 2019

From my personal experience I think Rust's string system is hard to beat at the moment. It's pretty darn good from a usability point of view and it also found a nice solution to work with UCS2 windows APIs by providing a OsStr type.

chrismorgan · on May 30, 2019

I’m glad that Rust strings aren’t indexable by integer, but I think that making them indexable by range (of UTF-8 code unit offsets) was an error. `foo[0..10]` should have been `foo.slice(0..10)` or similar instead.

the_mitsuhiko · on May 30, 2019

It’s a bit of a footgun indeed but it’s quite handy in combination with the char index iterator.

chrismorgan · on May 30, 2019

Sure, you do want to be able to index by code unit range, but it shouldn’t have been with the Index trait.

zucker42 · on May 30, 2019

There's no reason this couldn't be added now, though, right?

richardwhiuk · on May 30, 2019

The only annoyance is the occasional unwrap, when something is provably impossible, but the type system can't detect it.

1f60c · on May 30, 2019

That’s what `unwrap` is for, though.

jasonhansel · on May 30, 2019

Could you store the string in multibyte form, and then keep a skip list (or other data structure) to get indexing in O(log n)?

cryptonector · on May 31, 2019

The xi rope science blog post series is, I think, the definitive answer to UTF-8 "indexing".

tracker1 · on May 30, 2019

Isn't it just best to do NFKC (or similar) normalization on input for indexing?

rwmj · on May 30, 2019

It's a good article, but it would be nice if he'd also covered the Python 3 C API anti-pattern: forcing strings to be utf-8. This means that you have latent bugs in your code. Notably when trying to treat all filenames as strings, sooner or later your code will explode when it meets someone's filesystem which has ancient Latin1 filenames. Also when dealing with unfiltered user input.

(And to head off replies - yes I understand you can "just do X" where "X" is some complicated thing to avoid the bug if you remember to do "X" beforehand)

eadmund · on May 30, 2019

I don't really consider that an anti-pattern. Sure, you can get away with just blitting them to the terminal and hoping that they display properly, but sooner or later you're going to have to decode such byte strings anyway.

The real anti-pattern is conflating byte strings and character strings in the first place. We got away with it for decades, but in a UTF-8 world it just isn't possible.

skybrian · on May 30, 2019

I don't see why not? Go has a string type that contains arbitrary bytes, interpreted as UTF-8. This seems to work as well as anything else. If there are non-codepoints and it matters, you just have to deal with it (for example by printing an escape sequence or Unicode replacement character).

https://blog.golang.org/strings

anoncake · on May 30, 2019

Because arbitrary bytes cannot be interpreted as UTF-8. I guess this kind of thing is tolerated by Go users because anyone who values a proper type system uses a language with generics.

skybrian · on May 30, 2019

How do you fix a file that has errors in it if the standard library of the language you're using won't even let you read it?

eadmund · on May 31, 2019

If you're fixing bytes then you load bytes and fix them.

You won't, though, fix bytes by loading characters and then trying … to fix the bytes … the characters encode to. Just doesn't make sense.

We were able to get away with stuff for a long time because bytes were characters and characters were bytes and we could think sloppily and not break anything. But with Unicode they really are different things, and we need to be tidier in our thinking.

skybrian · on May 31, 2019

Seems like you're just reasserting it doesn't make sense, without giving a reason. But it does make sense in Go.

eadmund · on May 31, 2019

> But it does make sense in Go.

No, Go doesn't work that way. You asked, 'How do you fix a file that has errors in it if the standard library of the language you're using won't even let you read it?' In Go, you don't read file as strings, but rather as bytes (proof: https://golang.org/pkg/os/#Open, which returns a File which implements Read: https://golang.org/pkg/os/#File.Read).

You would do the same thing in Python: open the file in binary mode, and the iterate over the bytes it yields.

Now, the one thing that would be annoying in Go is fixing a broken filename. I'd have to think a bit to figure that out.

skybrian · on May 31, 2019

You can cast between byte arrays and strings in Go. The difference is that strings are immutable (so it does a copy).

eadmund · on May 31, 2019

> You can cast between byte arrays and strings in Go.

Yes, you can. But, in the specific case you mentioned, no competent programmer would cast the bytes of an invalidly-encoded file to a string, then iterate through the runes of the string. That wouldn't even begin to make sense!

I really don't understand what you're trying to argue here.

skybrian · on May 31, 2019

Although it only works for smallish files, that seems fairly useful for getting as much info as you can out of a corrupt but mostly UTF-8 file?

Any runes that aren't valid will come back as the replacement character. And you can count newlines and print the location of the error(s). You also have the index of the error.

Diggsey · on May 30, 2019

The problem is not with forcing strings to be utf-8, the problem is treating filenames as strings.

Filenames are opaque blobs that can be lossily converted to strings for display if you know or can guess at the encoding.

bluerobotcat · on May 30, 2019

Opaque, except for '\0', '/', and (to some extent) '.'.

Diggsey · on May 30, 2019

Even those details are platform-specific though. If you want to be truly portable, you can't even assume that paths are byte arrays.

On windows, the path separator is '\' and paths are arrays of 16-bit integers.

ChrisSD · on May 30, 2019

Windows is tricky. You can't have certain names like "con" (or "con.txt", "con.png", etc) and some symbols aren't allowed either, like *, ?, etc. Also names can't end with a dot.

Other than some explicit exclusions, any wchar is valid whether or not it's valid unicode. After all, NTFS and Windows dates back to the times of UCS-2 when 16bits was enough for any character™.

EDIT: Though I should hasten to add that it's a very strong convention that all paths be UTF-16 encoded. So much so that many official docs assert this to be true even though it technically isn't.

jodrellblank · on May 31, 2019

NTFS doesn't care if you have a file called "con", e.g. in PowerShell you can do:

    New-Item -ItemType File -Path "\\?\d:\con"

and get "D:\con", where you can't create it directly as "D:\con". It's the Win32 API which intercepts "con" for backwards compatibility, because it was a meaningful name in MS-DOS. But it's fine as a filesystem path.

There's other fun Windows/NTFS Path things here: https://news.ycombinator.com/item?id=17307023 and Google Project Zero's deep dive into Win32 and NT path handling: https://googleprojectzero.blogspot.com/2016/02/the-definitiv...

burntsushi · on May 30, 2019

> So much so that many official docs assert this to be true even though it technically isn't.

Do you have any links for that? I've been working with winapi recently and have had a hell of a time getting some clear concrete statements about exactly what encoding (if any) is used in file paths.

ChrisSD · on May 30, 2019

https://docs.microsoft.com/en-us/windows/desktop/FileIO/nami...

> the file system treats path and file names as an opaque sequence of WCHARs.

In essence I think you should use UTF-16 encoded strings when creating file paths. However, when reading them you can't assume any encoding (aside for the special characters mentioned in that article). For accessing the filesystem, just treat paths as an opaque blob of data. When displaying a name to the user, assume UTF-16 encoding but handle any decoding errors (e.g. by using replacement characters where neceeary).

burntsushi · on May 30, 2019

Oh, I meant, did you have any links from official docs that said UTF-16 was used?

Your advice is fine, but when the rest of the world is UTF-8 (including the regex engine), things become quite a bit trickier!

ChrisSD · on May 30, 2019

Oh I see. UTF-16 is the preferred encoding for all new applications: https://docs.microsoft.com/en-us/windows/desktop/intl/unicod...

Basically, in Windows land, unicode means UTF-16 unless code pages are mentioned https://docs.microsoft.com/en-us/windows/desktop/intl/code-p...

jodrellblank · on May 30, 2019

On Windows the path separator is U+005c, it's only a backslash in most codepages, but not all: https://devblogs.microsoft.com/oldnewthing/20051014-20/?p=33... which links to a dead link; copy here http://archives.miloush.net/michkap/archive/2005/09/17/46994...

That doesn't change just because Unicode renders individual codepages obsolete, it's now special-cased into Windows that Japanese and Korean situations display U+005c as a currency symbol instead of a backslash.

There's also [System.IO.Path]::AltDirectorySeparatorChar which is `/` because Windows is often fine with / as a path separator as well.

benj111 · on May 30, 2019

When you say 'explode' what do you mean? I can see rendering being a problem, but then if someone decided to use cuneiform in their file names I'd guess many people would have problems rendering that. Surely as long as there's internal consistency???

Now I could see a mechanism where you UTF-8 code could 'explode' latin1 only software.

masklinn · on May 30, 2019

> When you say 'explode' what do you mean?

You will get runtime errors or data loss/corruption.

> then if someone decided to use cuneiform in their file names I'd guess many people would have problems rendering that. Surely as long as there's internal consistency???

That's not the issue. The issue is that neither UNIX nor Windows file names are guaranteed to be valid unicode (I think OSX's are):

* UNIX filenames are semi-arbitrary bags of bytes, there is no guarantee whatsoever those bags will be utf8-compatible in any way.

* Windows file names are semi-arbitrary bags of UTF16 code units, meaning they can contain unpaired surrogates, meaning they can't be decoded to unicode and thus can't be transcoded to UTF8.

Which means the conversion to unicode will either be partial (it will error out) or they will be lossy (the data will not round-trip).

Either way, it'll cause intractable issues for the developer who will either have filesystem APIs blowing up in their face with little to no way of handling it, or the data they return will not necessarily be usable down the line.

andrewaylett · on May 30, 2019

On the topic of unpaired surrogates, that a problem WTF-8 (https://simonsapin.github.io/wtf-8/) is intended to help solve.

The spec was created for Servo/Rust, but it's a sane general internal representation that should let people interact with platform APIs in a lossless manner.

masklinn · on May 30, 2019

> On the topic of unpaired surrogates, that a problem WTF-8 (https://simonsapin.github.io/wtf-8/) is intended to help solve.

Yes. And it does so just fine. But you probably don't want your core string type to be that, so it's used as part of the "third way" where filenames are not strings, so that windows filename are relatively cheaply convertible to strings: by transcoding to wtf8 upfront, converting from filenames to strings is just UTF8 validation; and converting from UTF8 to filename is free. And likewise for "byte array" unix filenames.

benj111 · on May 30, 2019

"UNIX filenames are semi-arbitrary bags of bytes"

That's what I'm thinking (most experience is with Linux). So it isn't as if π is represented internally as 'π', it is just a bag of bytes, so trying to make more sense of it than that, is in a sense wrong.

Edit: I guess I'm assuming the seemingly obvious step of checking for valid input? I mean if you get a bag of bytes and start trying to do utf-8 things on it, without checking for errors.... Is that what were talking about here?

wodenokoto · on May 30, 2019

> yes I understand you can "just do X" where "X" is some complicated thing to avoid the bug if you remember to do "X" beforehand

Either you get these bugs when working with unicode, or you get them when working with strings that are not unicode-compatible. It is an implementation trade-off, and unlike you and the author, I prefer the python way.

namibj · on May 30, 2019

It's like how you can't use perl6 syntax for bencoded data (a length-prefixed/T(L)V format used in .torrent files), because the format contains some raw binary unsigned integers that make it invalid utf-8 in practice.

raiph · on May 30, 2019

Are you sure UTF8-C8[1] doesn't cover this use case?

[1] https://docs.perl6.org/language/unicode#UTF8-C8