> The only reason I can see is in ensuring that text is losslessly convertible t...

maxdamantus · on May 31, 2019

I'm not sure there's a clear definition of what valid "text" is, but surely classifying it as something like "a sequence of Unicode scalar values" (equivalent to "something that can be encoded in a UTF") is a bit arbitrary. Is a sequence of unassigned Unicode scalar values really "text"? Maybe "text" should not start with a combining character. Maybe it should be grammatically valid in some human language.

Again, unless the point of all of this is to cater for obsolete encodings for the rest of eternity, "sequence of code points" (or here "sequence of Unicode scalar values" [0]) seems just as arbitrary as "sequence of bytes".

[0] Probably worth mentioning that as far as I'm aware, these systems tend to use code points, not Unicode scalar values, so the strings are not guaranteed representable by UTFs anyway (Python allows "\udc80", and even started making use of such strings as a latent workaround to handle non-UTF-8 input some time after Python 3 was released [1])

[1] https://www.python.org/dev/peps/pep-0383/

masklinn · on May 31, 2019

I've no idea what you're trying to say. All I'm saying is

> ensuring that text is losslessly convertible to other UTFs

is a truism, it can't not be true. You can always convert back and forth between UTF8 and UTF16 with no loss or alteration of data.

maxdamantus · on May 31, 2019

It's not a truism if "text" includes bytes that are not UTF-8. A program would probably naturally model a filename as being text, but if that means "sequence of Unicode scalar values", it's arguably incorrect on some obscure systems, such as Linux and git.

masklinn · on May 31, 2019

> It's not a truism if "text" includes bytes that are not UTF-8.

"other UTFs" assume it's in one UTF to start with. If it's in a random non-encoding, possibly no encoding at all, there's no point in attempting a conversion, you can't turn random garbage into non-garbage.

> A program would probably naturally model a filename as being text

That's incorrect modelling and a well known error, as discussed in other sub-threads.

maxdamantus · on May 31, 2019

> "other UTFs" assume it's in one UTF to start with

Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

> That's incorrect modelling and a well known error, as discussed in other sub-threads.

It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

masklinn · on May 31, 2019

> Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

> It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

The entire point of separate string-like type is making it clear what's what up-front. It doesn't require any sort of internationalisation, it just tells you that filenames are not strings. Because they're not.

> An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

Go just throws up its hand and goes "fuck you, strings are random aggregates of garbage and our "string functions" will at best blow up and at worst corrupt the entire thing if you use them" (because they either assert or assume that strings are UTF8-encoded).

It only "handles this stuff fine" in the sense that it does not handle it at all, will not tell you that your code is going to break, and will provide no support whatsoever when it does.

maxdamantus · on May 31, 2019

> That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

As a last resort to try and clarify what I meant in my original post (requoted below):

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons)

What I mean here is that if converting to a UTF is important, then maybe restricting strings to code points or Unicode scalar values is justified. If textual data is stored in bytes that are conventionally UTF-8, there should be no need to do any conversion to a UTF, since ultimately the only UTF that is useful should be UTF-8. All you would be doing by "converting to a UTF" is losing information.

That was my last attempt. I'm sorry if you still can't understand it.