The article does not mention Python, other than to reference CPython's "Flexible String Representation". However, it's interesting that alternative Python implementations have decided against that model and indeed use UTF-8 strings internally.
MicroPython saves memory by simply making indexing into its strings O(n) [1], while PyPy's UTF-8 strings have "an optional extra index data structure to make indexing O(1)" [2].
For compatibility, of course, Python implementations have to provide indexing of code points - it would be interesting to examine the pros & cons of the different string representations. I wonder if new high-level languages would be better off using one of these representations, or taking the Go/Julia approach of only indexing bytes.
Including the quote marks, spaces, and question mark, that's 18 characters. This isn't just about text editing, far from it. For a lot of string processing, indexing into the underlying codepoints is even less interesting than indexing into the underlying bytes.
I am not a linguist, but as a native speaker, shouldn't they be considered 15 characters? क्स, क्या and र्थ each form individual conjunct consonants. Counting them as two would then beget the question as to why डे is not considered two characters too, seeing as it is formed by combining ड and ए, much like क्स is formed by combining क् and स.
If you say they should be considered 15 characters then software and devs should support optionally indexing and counting them as 15 characters. This is the most important point.
And, as a corollary, software devs should aspire to have and know about string functions in software that recognize that the text string I used is 15 characters long in contexts where that's the right way to view it. Furthermore, those functions should asap be as easily available for use as they are today for recognizing that the text 'What does "index" mean?' is 23 characters long.
This notion of software and devs properly indexing and counting characters was the ultimate point of my comment, as I will elaborate below. I hope that you will reply to confirm you understand the gist of what follows; that would make my day and leave this exchange on HN to hopefully shine light where it's sorely needed. :)
----
The OP title is "UTF-8 String Indexing Strategies". I could write that this begs the question What does "index" mean? Unfortunately it seems it still doesn't beg the question -- in 2019 -- for most western devs.
Last century devs generally assumed the index unit was bytes. So they created programming languages whose string type assumed indexing in bytes and functions and libraries that did the same. Nowadays they're starting to assume "codepoints", which is an equally broken assumption. (Codepoints are a Unicode notion and they're great for what they're great for. But being "characters" is, in the general case, something they're terrible for.)
Both these western devs and the OP are effectively ignoring the possibility that "इंडेक्स" का क्या अर्थ है? could be considered to be 15 characters (or 18). They're ignoring you, the half of the planet that are in a similar boat, and the whole of the planet that's coming together, sharing text like we are here.
Neither of these deals with indexing characters, as one might expect based on an ordinary human's understanding of the word "characters". Instead they're myopically focused on indexing bytes and codepoints.
This goes hand-in-hand with Python's length function returning 26 for the text "इंडेक्स" का क्या अर्थ है?. It's counting codepoints, not characters, which is close to useless for that text.[1]
But you wouldn't have any clue about that from bakery2k's comment and it looks like bakery2k has no awareness of this:
> I wonder if new high-level languages would be better off using one of these [byte and codepoint] representations, or taking the Go/Julia approach of only indexing bytes.
Imo that's shockingly retrogressive given the lack of discussion of characters.
----
Chances are good that if you try to select the text I wrote one character at a time you will find that you can cursor across 18 units.
Why/how does software do this? It relies on part of the Unicode standard for indexing that builds on the concept of "what a user thinks of as a character".[2]
This mechanism allows the string to be indexed/counted as N characters, where N varies according to the definition of "character". Software is supposed to choose the definition with appropriate adherence to the Unicode standard, which includes customizing it as necessary to produce practical results. And, as I noted, most good modern software dealing with cursoring/editing text gets it right per the Unicode standard.
My guess the Unicode standard by default has software consider क्स to be 2 characters because the consonant is comprised of क् and स placed visually side-by-side whereas it has डे be considered 1 character because it's comprised of ड and ए somehow overlapping visually. (That's a pure guess. Please let me know if it sounds crazy. :))
For some other use cases, like a native speaker just reading text abstractly, what's a character changes. You say the text I wrote is 15 characters; therefore software should be able to index and count it as 15 characters.
I hope that all makes sense. Thank you for your comment, reading my reply, and TIA for any reply. :)
Sorry for the late reply, I don't use HN much. No idea if you'll actually notice this, does HN even have a "Reply Notification" feature?
Regarding what you wrote, I agree pretty much. As I said, I am not an expert in this field, so I am not aware of the most cutting edge stuff put there. But even the few languages I know and have seen are so different from each other (some more than others) that it seems unlikely that a single "theory of everything" would suffice for text, especially in the way we process text presently.
Perhaps there is some way to abstract out the differences, but I don't really see how. After all, characters are where the differences only begin. Start thinking about words or sentences and no single route seems viable for the way we do string processing today.
You probably expected a more substantial comment, but I don't really know enough of this field to make one.
Regarding क्स and डे, the difference between them is that the former is a combination of two consonants (pronounced "ks") while the latter is formed by a consonant and a vowel ("de"). However, looking at the visual representation is wrong, since डा (consonant+vowel) would also look like two characters. If you copy these into a text field and try to erase them through backspace or delete, you should see how it all works (assuming the text field functions correctly).
But again, these confusions only exist because Devnagari allows simple characters to form compound characters. That is obviously completely different than how Roman script works, which is probably completely different than various pictographic languages. So, how to reconcile the differences (except by hiring native speakers of every language out there)? I wish I knew, but currently I don't.
MicroPython saves memory by simply making indexing into its strings O(n) [1], while PyPy's UTF-8 strings have "an optional extra index data structure to make indexing O(1)" [2].
For compatibility, of course, Python implementations have to provide indexing of code points - it would be interesting to examine the pros & cons of the different string representations. I wonder if new high-level languages would be better off using one of these representations, or taking the Go/Julia approach of only indexing bytes.
[1] https://github.com/micropython/micropython/blob/a4f1d82757b8...
[2] https://twitter.com/pypyproject/status/1095971192513708032