When you say 'explode' what do you mean? I can see rendering being a problem, bu...

masklinn · on May 30, 2019

> When you say 'explode' what do you mean?

You will get runtime errors or data loss/corruption.

> then if someone decided to use cuneiform in their file names I'd guess many people would have problems rendering that. Surely as long as there's internal consistency???

That's not the issue. The issue is that neither UNIX nor Windows file names are guaranteed to be valid unicode (I think OSX's are):

* UNIX filenames are semi-arbitrary bags of bytes, there is no guarantee whatsoever those bags will be utf8-compatible in any way.

* Windows file names are semi-arbitrary bags of UTF16 code units, meaning they can contain unpaired surrogates, meaning they can't be decoded to unicode and thus can't be transcoded to UTF8.

Which means the conversion to unicode will either be partial (it will error out) or they will be lossy (the data will not round-trip).

Either way, it'll cause intractable issues for the developer who will either have filesystem APIs blowing up in their face with little to no way of handling it, or the data they return will not necessarily be usable down the line.

andrewaylett · on May 30, 2019

On the topic of unpaired surrogates, that a problem WTF-8 (https://simonsapin.github.io/wtf-8/) is intended to help solve.

The spec was created for Servo/Rust, but it's a sane general internal representation that should let people interact with platform APIs in a lossless manner.

masklinn · on May 30, 2019

> On the topic of unpaired surrogates, that a problem WTF-8 (https://simonsapin.github.io/wtf-8/) is intended to help solve.

Yes. And it does so just fine. But you probably don't want your core string type to be that, so it's used as part of the "third way" where filenames are not strings, so that windows filename are relatively cheaply convertible to strings: by transcoding to wtf8 upfront, converting from filenames to strings is just UTF8 validation; and converting from UTF8 to filename is free. And likewise for "byte array" unix filenames.

benj111 · on May 30, 2019

"UNIX filenames are semi-arbitrary bags of bytes"

That's what I'm thinking (most experience is with Linux). So it isn't as if π is represented internally as 'π', it is just a bag of bytes, so trying to make more sense of it than that, is in a sense wrong.

Edit: I guess I'm assuming the seemingly obvious step of checking for valid input? I mean if you get a bag of bytes and start trying to do utf-8 things on it, without checking for errors.... Is that what were talking about here?