> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs
As other top-level comments and the article have mentioned, different systems still use different internal representations, but all modern ones commit to being able to represent Unicode code points, with no automatic normalization. To that end I suppose that in terms of reasoning about the consistency of data in different systems, it is better to use code points than the actual size, which is left to the implementation.
The possibly better alternative would be to use lengths in UTF-8, but that might seem arbitrary to some. Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.
> Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.
But "á" is two code points (U+61, U+301). If you're looking for some lower bound (whatever that means), shouldn't it be 1? I imagine if you're looking for something like information density, the count of UTF-8 code units would at least be somewhat more informative than the count of code points.
I guess the crux of this whole point is that a sequence of code points is arbitrary in the same way as a sequence of bytes; neither "code point" nor "byte" necessarily corresponds to something that a user would see as a unit in human text. So why are we not using the simpler abstraction?
As other top-level comments and the article have mentioned, different systems still use different internal representations, but all modern ones commit to being able to represent Unicode code points, with no automatic normalization. To that end I suppose that in terms of reasoning about the consistency of data in different systems, it is better to use code points than the actual size, which is left to the implementation.
The possibly better alternative would be to use lengths in UTF-8, but that might seem arbitrary to some. Perhaps counting code points is useful in that it gives a lower bound on the length in any reasonable encoding.