Hacker News new | past | comments | ask | show | jobs | submit login

Is there any real case where code point indexing is useful? It seems like all these attempts to restrict strings in such a way to accommodate code points is just introducing complexity with no gain.

UTF-8 was designed to be an encoding (of code points) on top of the "bytes" abstraction just as Unicode is designed to be an encoding (of human text) on top of the "code points" abstraction. I think it should be uncontroversial that there are very good reasons to at least handle arbitrary sequences of code points (eg, you want to be able to handle input from future versions of Unicode, and you don't know about the grapheme clustering of those code points), but I don't see a good reason not to handle arbitrary sequences of bytes.

The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons), but this just seems like a matter of when the information is lost (is it during conversion from string to UTF-16, or from bytes to string), not if it is lost.

As far as I can tell with the Python story, for example, people decided to add special "Unicode" strings into Python 2, then presumably some code used the "Unicode" strings and some code used the "byte" strings, so this situation is obviously underisable .. then in Python 3 they tried fixing it by replacing which sort of string was the default. Why would it not have been better to just improve Unicode support for the existing strings instead of splitting the type into two and forcing everyone to decide whether their strings are for "bytes" or for "Unicode"?




> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons), but this just seems like a matter of when the information is lost (is it during conversion from string to UTF-16, or from bytes to string), not if it is lost.

And you can even avoid losing that information at all with encodings like WTF-8: https://simonsapin.github.io/wtf-8/


Programmers are rarely interested in individual bytes, except sometimes if a string is abused as a byte array. In all other cases iterating or indexing over characters is the intention, and code points are the proper abstraction for characters, not bytes.

Also I might be wrong, but you can just look at the bytes to know how much bytes a UTF-8 character is, since the first byte with value 0 to 127 represents the final byte while 128 to 255 represents that the next byte is part of the character.


Except that code points aren't the proper abstraction for "characters". Most people would think of <Family: Woman, Woman, Girl, Boy>[1] as one character, but it's really five code points; woman, zero width joiner, woman, zero width joiner, girl, zero width joiner, boy. If you tried doing an operation like reversing a string or removing the last character, and you treated a unicode code point as a "character", you would end up with the wrong result. If you just removed the last code point to implement a backspace, you would end up with a string which ends in a zero width joiner, which makes little sense; and when the user wants to insert, say, a girl emoji, that emoji will end up as a part of the family due to that trailing joiner, when the user expected it to be a separate emoji.

This applies to more than just emojis by the way; there are languages whose unicode representation is much more complicated than english or other languages with latin characters.

[1]: https://emojipedia.org/family-woman-woman-girl-boy/

EDIT: This comment originally used the actual emojis as examples, but hacker news just replaced every code point in the emoji with a space.


You don't even need emoji; eg U+63,U+300 (c̀) is one characters but two code points, and U+1C4 (DŽ) is two characters, but one code point. There's also U+1F3,U+308 (dz̈), which is two characters in two code points, but segments incorrectly if you split on code points instead of characters.

It's ambigous how to encode latin-small-a-with-ring-above (U+E4 vs U+61,U+30A). Decoding is also ambigous (most infamously Han grapheme clusters), but I'm not fluent enough in any of the affected languages to have a ready example.

Also, that's seven code points, not five.


You example is a little confusing from the use of the word "characters". I think glyphs would be more clear (U+1C4 is two glyphs). Though it might not actually be 2 glyphs, its dependent on how the font implements it.

At the end of the day, an OpenType font through substitutions can do far more crazy things than these "double glyph" examples. I once made a font that took a 3 letter acronym, substituted this for a hidden empty glyph in the font, then substituted this single hidden empty glyph into 20 glyphs. You were left with something like your U+1C4 in software where you could only highlight all 20 glyphs of none of them. And this was happening on text where all input code points were under 127. People often don't realize how much logic and complexity can be put into a font or how much the font is responsible for doing.


"Squiggles" is my preferred word here. There are a bunch of technical terms like Glyph and Codepoint and Grapheme, but I find squiggles are often what somebody wanted when they used something that works on "characters" and are disappointed with the results.

Advice elsewhere is correct. You almost certainly don't want anything in your high level language other than strings. No "array of characters" no slicing, all that stuff is probably smell. They're like the API in many crypto libraries for doing ECB symmetric encryption. 99% of people using it would have been better off if the Stack Overflow answer they're about to read is explaining what they should be doing instead.


No, "characters" is the correct term; U+1C4 is two characters: latin-capital-d followed by latin-capital-z-with-caron (or whatever you want to call the little v thing). As you note, this means that non-buggy fonts will generally use two glyphs to render it, but that's a implementation detail; a font could process the entire word "the" as one glyph, or render "m" as dotless-i + right-half-of-n + right-half-of-n, but that wouldn't affect how many characters either string has.


Using characters is confusing because I don't know if you mean before or after shaping. U+1C4 is unquestionably a single Unicode code point. I've heard people call this 1 logical character. Other people might say how many characters it requires for encoding in UTF-8 or in UTF-16. After shaping, some people might say it is 1 or 2 "shaped characters". It's all horribly confusing. I find using the term code point more precise.


There's no shaping involved; I'm not talking about the implementation details of the rendering algorithm. There is a D, followed by a Ž. This only seems confusing because Unicode (and - to be fair - other, earlier character encodings) willfully misinterprets the term "character" for self-serving purposes.


Swift handles both cases well:

    var test = "test\u{1F469}"
    var tset = String(test.reversed())
    var tes = test.dropLast()
(The second line needs the extra ‘String’ to turn a sequence into a String; and yes, the names of the variable do not match their content)


Doing it with just one emoji sort of misses the point...

One way in P6, combining the full family into one character:

    my \test = "test\c[Woman,ZWJ,Woman,ZWJ,Girl,ZWJ,Boy]";
    say test.chars; # 5
    say test;       # test   
    say flip test;  #    tset
    say test.chop;  # test
HN displays the 7 codepoint family as three spaces.

To see that P6 treats the family as one:

https://tio.run/##K0gtyjH7/18BCHIrFWJKUotLFGwV1EG0ukKdQmleZk...


Sorry. Tested with the family, but didn’t notice that I only took the single code point when making _something_ show upon HN. It also works for

  “test\u{1F469}\u{200d}\u{1f469}\u{200d}\u{1f467}\u{200d}\u{1f466}"


A codepoint is the smallest unit of meaning in unicode. A byte is just a number, that might (or might not) have meaning in a specific unicode encoding. (Also depending on what other bytes it's next to).

A codepoint is the smallest unit that has a graphical representation you can print on screen.

A codepoint is the smallest unit that allows API's that are agnostic to encoding, just in terms of the semantic content of unicode. If you want to write any kind of algorithm in terms of the actual character meaning (charecters represented), you want a codepoint abstraction. Most unicode algorithms -- like for collation, normalization, regexp character classes -- are in terms of codepoints.

If you split a unicode string on codepoints, the results are always valid unicode strings. If you split a unicode string on bytes, they may not be.

Human written language is complicated. Unicode actually does a pretty amazing job of providing an abstraction for dealing with it, but it's still complicated. It's true that it would be a (common) misconception to think that a codepoint always represents "one block on the screen", a "user-perceived character", (a "grapheme cluster"). If you start really getting into it, you realize "a user-perceived character" is a more complex concept than you thought/would like; not because of unicode but because of the complexities of global written human language and what software wants to do with it. But most people who have tried writing internationalized text manipulation of any kind with an API that is only in terms of bytes -- will know that codepoints is definitely superior.

If you do need "user-perceived characters" aka "grapheme clusters" -- unicode has an algorithm for that, based on data for each codepoint in the unicode database. https://unicode.org/reports/tr29/ It can be locale-dependent (whereas codepoints are locale independent). And guess what, the algorithm is in terms of codepoints -- if you wanted to implement the algorithm, you would usually want an API based on a codepoint abstraction to start with.

The "grapheme cluster" abstraction is necessarily more expensive to deal with than the "codepoint" abstraction (which is itself necessarily more expensive than "bytes") -- "codepoint" is quite often the right balance. I suppose if computers were or got another couple of magnitudes faster, we might all want/demand more widespread implementation of "grapheme cluster" as the abstraction for many more things -- but it'd still be described and usually implemented in terms of the "codepoint" abstraction, and you'd still need the codepoint abstraction for many things, such as normalization. But yes, it would be nice if more platforms/libraries provided "grapheme cluster" abstraction too. But it turns out you can mostly get by with "codepoint". You can't really even get by with just bytes if you want to do any kind of text manipulation or analysis (such as regexp). And codepoint is the abstraction on which "grapheme cluster" is built, it's the lower level and simpler abstraction, so is the first step -- and some platforms have only barely gotten there. A "grapheme cluster" is made up of codepoints.

I suppose one could imagine some system that isn't unicode that doesn't use a "codepoint" abstraction but somehow only had "user-perceived characters"... but it would get pretty crazy, for a variety of reasons including but not limited to that "user-perceived character" can be locale-dependent. "codepoint" is a very good and useful abstraction, and is the primary building block of unicode, so it makes sense that unicode-aware platform APIs also use it as a fundamental unit. A codepoint is the unit on which you can look up metadata in the unicode database, for normalization, upper/lowercasing, character classes for regexps, collation (sort order), etc. Unicode is designed to let you do an awful lot with codepoints, in as performant a manner as unicode could figure out.


> If you split a unicode string on codepoints, the results are always valid unicode strings.

"One of the reasons why the Unicode Standard avoids the term “valid string”, is that it immediate begs the question, valid for what?"

Source: http://www.unicode.org/mail-arch/unicode-ml/y2013-m01/0034.h...

The only thing you get by splitting a sequence of codepoints at random is another sequence of codepoints. Because you can end up with codepoint sequences that map to different glyphs or end up being ignored when they wouldn't have been in their proper order, you can end up with non-sense. You can shuffle a sequence of ASCII characters and still end up with a sequence of ASCII characters. What good is that? I fail to see how it would be qualitatively different than splitting a UTF-8 string at arbitrary code points. The latter is supposed to induce an error, but the former doesn't necessarily. The Unicode specification is written in a way to degrade softly when manipulated or displayed by poorly written software or old software dealing with future sequences with unique semantics. But that's not the same thing as saying that any sequence of codepoints is valid. Rather it's more akin to undefined behavior in C, except without a license to unleash nasal daemons.


But code points are not an abstraction for characters. The character "á" can be written as two code points (U+61, U+301). If a character can be two code points, why can't it be three bytes?


Code points are not the proper abstraction, since a character can be composed from a variable number of code points.

Also, bytes from 0-127 in UTF-8 (MSB is 0) are ASCII characters. In multibyte code points, the MSBs of the first byte are 11 and of continuation bytes 10.


a code unit with value 0-127 always maps to a single code point, i.e. it is never part of a multibyte character. It also maps 1:1 to ASCII values. So, if you are looking for specific characters in the ASCII code set, it is perfectly fine to iterate one byte at a time on an utf-8 string.


> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs

As other top-level comments and the article have mentioned, different systems still use different internal representations, but all modern ones commit to being able to represent Unicode code points, with no automatic normalization. To that end I suppose that in terms of reasoning about the consistency of data in different systems, it is better to use code points than the actual size, which is left to the implementation.

The possibly better alternative would be to use lengths in UTF-8, but that might seem arbitrary to some. Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.


> Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.

But "á" is two code points (U+61, U+301). If you're looking for some lower bound (whatever that means), shouldn't it be 1? I imagine if you're looking for something like information density, the count of UTF-8 code units would at least be somewhat more informative than the count of code points.

I guess the crux of this whole point is that a sequence of code points is arbitrary in the same way as a sequence of bytes; neither "code point" nor "byte" necessarily corresponds to something that a user would see as a unit in human text. So why are we not using the simpler abstraction?


> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16

Text is by definition losslessly convertible between UTFs. You only "lose" information if your source is not a correct UTF stream e.g. if you have lone surrogates you don't actually have UTF-16, you have a garbage pile of UTF-16 code units.

Now it may be useful to properly round-trip this garbage pile (e.g. you're dealing with filenames on Windows), but this should not be confused with data conversion between UTFs: your source simply is not UTF-16.


I'm not sure there's a clear definition of what valid "text" is, but surely classifying it as something like "a sequence of Unicode scalar values" (equivalent to "something that can be encoded in a UTF") is a bit arbitrary. Is a sequence of unassigned Unicode scalar values really "text"? Maybe "text" should not start with a combining character. Maybe it should be grammatically valid in some human language.

Again, unless the point of all of this is to cater for obsolete encodings for the rest of eternity, "sequence of code points" (or here "sequence of Unicode scalar values" [0]) seems just as arbitrary as "sequence of bytes".

[0] Probably worth mentioning that as far as I'm aware, these systems tend to use code points, not Unicode scalar values, so the strings are not guaranteed representable by UTFs anyway (Python allows "\udc80", and even started making use of such strings as a latent workaround to handle non-UTF-8 input some time after Python 3 was released [1])

[1] https://www.python.org/dev/peps/pep-0383/


I've no idea what you're trying to say. All I'm saying is

> ensuring that text is losslessly convertible to other UTFs

is a truism, it can't not be true. You can always convert back and forth between UTF8 and UTF16 with no loss or alteration of data.


It's not a truism if "text" includes bytes that are not UTF-8. A program would probably naturally model a filename as being text, but if that means "sequence of Unicode scalar values", it's arguably incorrect on some obscure systems, such as Linux and git.


> It's not a truism if "text" includes bytes that are not UTF-8.

"other UTFs" assume it's in one UTF to start with. If it's in a random non-encoding, possibly no encoding at all, there's no point in attempting a conversion, you can't turn random garbage into non-garbage.

> A program would probably naturally model a filename as being text

That's incorrect modelling and a well known error, as discussed in other sub-threads.


> "other UTFs" assume it's in one UTF to start with

Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

> That's incorrect modelling and a well known error, as discussed in other sub-threads.

It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.


> Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

> It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

The entire point of separate string-like type is making it clear what's what up-front. It doesn't require any sort of internationalisation, it just tells you that filenames are not strings. Because they're not.

> An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

Go just throws up its hand and goes "fuck you, strings are random aggregates of garbage and our "string functions" will at best blow up and at worst corrupt the entire thing if you use them" (because they either assert or assume that strings are UTF8-encoded).

It only "handles this stuff fine" in the sense that it does not handle it at all, will not tell you that your code is going to break, and will provide no support whatsoever when it does.


> That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

As a last resort to try and clarify what I meant in my original post (requoted below):

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons)

What I mean here is that if converting to a UTF is important, then maybe restricting strings to code points or Unicode scalar values is justified. If textual data is stored in bytes that are conventionally UTF-8, there should be no need to do any conversion to a UTF, since ultimately the only UTF that is useful should be UTF-8. All you would be doing by "converting to a UTF" is losing information.

That was my last attempt. I'm sorry if you still can't understand it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: