Hacker News new | past | comments | ask | show | jobs | submit login

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16

Text is by definition losslessly convertible between UTFs. You only "lose" information if your source is not a correct UTF stream e.g. if you have lone surrogates you don't actually have UTF-16, you have a garbage pile of UTF-16 code units.

Now it may be useful to properly round-trip this garbage pile (e.g. you're dealing with filenames on Windows), but this should not be confused with data conversion between UTFs: your source simply is not UTF-16.




I'm not sure there's a clear definition of what valid "text" is, but surely classifying it as something like "a sequence of Unicode scalar values" (equivalent to "something that can be encoded in a UTF") is a bit arbitrary. Is a sequence of unassigned Unicode scalar values really "text"? Maybe "text" should not start with a combining character. Maybe it should be grammatically valid in some human language.

Again, unless the point of all of this is to cater for obsolete encodings for the rest of eternity, "sequence of code points" (or here "sequence of Unicode scalar values" [0]) seems just as arbitrary as "sequence of bytes".

[0] Probably worth mentioning that as far as I'm aware, these systems tend to use code points, not Unicode scalar values, so the strings are not guaranteed representable by UTFs anyway (Python allows "\udc80", and even started making use of such strings as a latent workaround to handle non-UTF-8 input some time after Python 3 was released [1])

[1] https://www.python.org/dev/peps/pep-0383/


I've no idea what you're trying to say. All I'm saying is

> ensuring that text is losslessly convertible to other UTFs

is a truism, it can't not be true. You can always convert back and forth between UTF8 and UTF16 with no loss or alteration of data.


It's not a truism if "text" includes bytes that are not UTF-8. A program would probably naturally model a filename as being text, but if that means "sequence of Unicode scalar values", it's arguably incorrect on some obscure systems, such as Linux and git.


> It's not a truism if "text" includes bytes that are not UTF-8.

"other UTFs" assume it's in one UTF to start with. If it's in a random non-encoding, possibly no encoding at all, there's no point in attempting a conversion, you can't turn random garbage into non-garbage.

> A program would probably naturally model a filename as being text

That's incorrect modelling and a well known error, as discussed in other sub-threads.


> "other UTFs" assume it's in one UTF to start with

Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

> That's incorrect modelling and a well known error, as discussed in other sub-threads.

It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.


> Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

> It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

The entire point of separate string-like type is making it clear what's what up-front. It doesn't require any sort of internationalisation, it just tells you that filenames are not strings. Because they're not.

> An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

Go just throws up its hand and goes "fuck you, strings are random aggregates of garbage and our "string functions" will at best blow up and at worst corrupt the entire thing if you use them" (because they either assert or assume that strings are UTF8-encoded).

It only "handles this stuff fine" in the sense that it does not handle it at all, will not tell you that your code is going to break, and will provide no support whatsoever when it does.


> That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

As a last resort to try and clarify what I meant in my original post (requoted below):

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons)

What I mean here is that if converting to a UTF is important, then maybe restricting strings to code points or Unicode scalar values is justified. If textual data is stored in bytes that are conventionally UTF-8, there should be no need to do any conversion to a UTF, since ultimately the only UTF that is useful should be UTF-8. All you would be doing by "converting to a UTF" is losing information.

That was my last attempt. I'm sorry if you still can't understand it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: