Hacker News new | past | comments | ask | show | jobs | submit login

> Maybe I should have been clearer. I just meant "other UTFs" as in "UTFs that are not UTF-8", ie, UTFs that exist primarily for historical resons.

That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

> It's only incorrect if you impose arbitrary meanings on what "text" is, thus leading to multiple string types that require internationalisation experts to use correctly.

The entire point of separate string-like type is making it clear what's what up-front. It doesn't require any sort of internationalisation, it just tells you that filenames are not strings. Because they're not.

> An example of a modern language that seems to handle this stuff fine without weird distinctions between filenames and other strings is Go (and possibly Julia, based on discussion in other subthreads) .. which incidentally was largely designed by one of the designers of UTF-8.

Go just throws up its hand and goes "fuck you, strings are random aggregates of garbage and our "string functions" will at best blow up and at worst corrupt the entire thing if you use them" (because they either assert or assume that strings are UTF8-encoded).

It only "handles this stuff fine" in the sense that it does not handle it at all, will not tell you that your code is going to break, and will provide no support whatsoever when it does.




> That makes no difference whatsoever. All UTFs are mappings of USVs to bytes.

As a last resort to try and clarify what I meant in my original post (requoted below):

> The only reason I can see is in ensuring that text is losslessly convertible to other UTFs, particularly UTF-16 (which exists for historical reasons)

What I mean here is that if converting to a UTF is important, then maybe restricting strings to code points or Unicode scalar values is justified. If textual data is stored in bytes that are conventionally UTF-8, there should be no need to do any conversion to a UTF, since ultimately the only UTF that is useful should be UTF-8. All you would be doing by "converting to a UTF" is losing information.

That was my last attempt. I'm sorry if you still can't understand it.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: