Because arbitrary bytes cannot be interpreted as UTF-8. I guess this kind of thi...

skybrian · on May 30, 2019

How do you fix a file that has errors in it if the standard library of the language you're using won't even let you read it?

eadmund · on May 31, 2019

If you're fixing bytes then you load bytes and fix them.

You won't, though, fix bytes by loading characters and then trying … to fix the bytes … the characters encode to. Just doesn't make sense.

We were able to get away with stuff for a long time because bytes were characters and characters were bytes and we could think sloppily and not break anything. But with Unicode they really are different things, and we need to be tidier in our thinking.

skybrian · on May 31, 2019

Seems like you're just reasserting it doesn't make sense, without giving a reason. But it does make sense in Go.

eadmund · on May 31, 2019

> But it does make sense in Go.

No, Go doesn't work that way. You asked, 'How do you fix a file that has errors in it if the standard library of the language you're using won't even let you read it?' In Go, you don't read file as strings, but rather as bytes (proof: https://golang.org/pkg/os/#Open, which returns a File which implements Read: https://golang.org/pkg/os/#File.Read).

You would do the same thing in Python: open the file in binary mode, and the iterate over the bytes it yields.

Now, the one thing that would be annoying in Go is fixing a broken filename. I'd have to think a bit to figure that out.

skybrian · on May 31, 2019

You can cast between byte arrays and strings in Go. The difference is that strings are immutable (so it does a copy).

eadmund · on May 31, 2019

> You can cast between byte arrays and strings in Go.

Yes, you can. But, in the specific case you mentioned, no competent programmer would cast the bytes of an invalidly-encoded file to a string, then iterate through the runes of the string. That wouldn't even begin to make sense!

I really don't understand what you're trying to argue here.

skybrian · on May 31, 2019

Although it only works for smallish files, that seems fairly useful for getting as much info as you can out of a corrupt but mostly UTF-8 file?

Any runes that aren't valid will come back as the replacement character. And you can count newlines and print the location of the error(s). You also have the index of the error.