> But any implementation will at least logically start with a sequence of bytes,...

netvl · on March 10, 2017

> I've never seen people using UTF-8 deal with a code unit stage. They parse directly from bytes to code points.

Well, that's probably because in UTF-8 code unit is byte :)

Quoting https://en.wikipedia.org/wiki/UTF-8:

> The encoding is variable-length and uses 8-bit code units.

By definition, code unit is a bit sequence of a fixed size which can form code points. In UTF-8 you form code points using 8-bit bytes, therefore in UTF-8 code unit is byte. In UTF-16 it is a sequence of two bytes. In UTF-32 it is a sequence of four bytes.

Dylan16807 · on March 10, 2017

I said as much in my first comment, yes. I'm not sure if I'm missing something in your comment?

Code units may 'exist' on all three through the fiat of their definition, but they only have a visible function and require you to process an additional layer in UTF-16.

burntsushi · on March 10, 2017

> I thought that was an invalid code point.

Surrogate codepoints are indeed valid codepoints. It's just that valid UTF-8 is not allowed to encode surrogate codepoints, so the space of codepoints supported by UTF-8 is actually a subset of all Unicode codepoints. This subset is known as the set of Unicode scalar values. ("All codepoints except for surrogates.")

Dylan16807 · on March 10, 2017

Those points cannot be validly encoded in any format. I suppose you can argue that they are valid-but-unusable in an abstract sense, since the unicode standard does not actually label any code points as valid/invalid, but if you were going to label any code points as invalid then those would be in the group.

danbruc · on March 10, 2017

You are certainly correct that it is common to not pay too much attention to what things are called in the specification, especially if you want to create a fast implementation. Logically you will still go through all the layers even if you operate on only one physical representation.

My admittedly quite limited experience with Unicode is from trying to exploit implementation bugs. And with that focus it is quite natural to pay close attention to the different layers in the specification in order to see where optimized implementations might miss edge cases.

And I am generally a big fan of staying close to the word of standards, if it does not cause unacceptable performance issues, I would always prefer to stick with the terminology of the standard even if it means that there will be transformations that are just the identity.

The distinction between code points and scalar values will for example become relevant if you implement Unicode meta data. There you can query meta data about surrogate code points even if a parser should never produce those code points.