Hacker News new | past | comments | ask | show | jobs | submit login

It's a good article, but it would be nice if he'd also covered the Python 3 C API anti-pattern: forcing strings to be utf-8. This means that you have latent bugs in your code. Notably when trying to treat all filenames as strings, sooner or later your code will explode when it meets someone's filesystem which has ancient Latin1 filenames. Also when dealing with unfiltered user input.

(And to head off replies - yes I understand you can "just do X" where "X" is some complicated thing to avoid the bug if you remember to do "X" beforehand)




I don't really consider that an anti-pattern. Sure, you can get away with just blitting them to the terminal and hoping that they display properly, but sooner or later you're going to have to decode such byte strings anyway.

The real anti-pattern is conflating byte strings and character strings in the first place. We got away with it for decades, but in a UTF-8 world it just isn't possible.


I don't see why not? Go has a string type that contains arbitrary bytes, interpreted as UTF-8. This seems to work as well as anything else. If there are non-codepoints and it matters, you just have to deal with it (for example by printing an escape sequence or Unicode replacement character).

https://blog.golang.org/strings


Because arbitrary bytes cannot be interpreted as UTF-8. I guess this kind of thing is tolerated by Go users because anyone who values a proper type system uses a language with generics.


How do you fix a file that has errors in it if the standard library of the language you're using won't even let you read it?


If you're fixing bytes then you load bytes and fix them.

You won't, though, fix bytes by loading characters and then trying … to fix the bytes … the characters encode to. Just doesn't make sense.

We were able to get away with stuff for a long time because bytes were characters and characters were bytes and we could think sloppily and not break anything. But with Unicode they really are different things, and we need to be tidier in our thinking.


Seems like you're just reasserting it doesn't make sense, without giving a reason. But it does make sense in Go.


> But it does make sense in Go.

No, Go doesn't work that way. You asked, 'How do you fix a file that has errors in it if the standard library of the language you're using won't even let you read it?' In Go, you don't read file as strings, but rather as bytes (proof: https://golang.org/pkg/os/#Open, which returns a File which implements Read: https://golang.org/pkg/os/#File.Read).

You would do the same thing in Python: open the file in binary mode, and the iterate over the bytes it yields.

Now, the one thing that would be annoying in Go is fixing a broken filename. I'd have to think a bit to figure that out.


You can cast between byte arrays and strings in Go. The difference is that strings are immutable (so it does a copy).


> You can cast between byte arrays and strings in Go.

Yes, you can. But, in the specific case you mentioned, no competent programmer would cast the bytes of an invalidly-encoded file to a string, then iterate through the runes of the string. That wouldn't even begin to make sense!

I really don't understand what you're trying to argue here.


Although it only works for smallish files, that seems fairly useful for getting as much info as you can out of a corrupt but mostly UTF-8 file?

Any runes that aren't valid will come back as the replacement character. And you can count newlines and print the location of the error(s). You also have the index of the error.


The problem is not with forcing strings to be utf-8, the problem is treating filenames as strings.

Filenames are opaque blobs that can be lossily converted to strings for display if you know or can guess at the encoding.


Opaque, except for '\0', '/', and (to some extent) '.'.


Even those details are platform-specific though. If you want to be truly portable, you can't even assume that paths are byte arrays.

On windows, the path separator is '\' and paths are arrays of 16-bit integers.


Windows is tricky. You can't have certain names like "con" (or "con.txt", "con.png", etc) and some symbols aren't allowed either, like *, ?, etc. Also names can't end with a dot.

Other than some explicit exclusions, any wchar is valid whether or not it's valid unicode. After all, NTFS and Windows dates back to the times of UCS-2 when 16bits was enough for any character™.

EDIT: Though I should hasten to add that it's a very strong convention that all paths be UTF-16 encoded. So much so that many official docs assert this to be true even though it technically isn't.


NTFS doesn't care if you have a file called "con", e.g. in PowerShell you can do:

    New-Item -ItemType File -Path "\\?\d:\con"
and get "D:\con", where you can't create it directly as "D:\con". It's the Win32 API which intercepts "con" for backwards compatibility, because it was a meaningful name in MS-DOS. But it's fine as a filesystem path.

There's other fun Windows/NTFS Path things here: https://news.ycombinator.com/item?id=17307023 and Google Project Zero's deep dive into Win32 and NT path handling: https://googleprojectzero.blogspot.com/2016/02/the-definitiv...


> So much so that many official docs assert this to be true even though it technically isn't.

Do you have any links for that? I've been working with winapi recently and have had a hell of a time getting some clear concrete statements about exactly what encoding (if any) is used in file paths.


https://docs.microsoft.com/en-us/windows/desktop/FileIO/nami...

> the file system treats path and file names as an opaque sequence of WCHARs.

In essence I think you should use UTF-16 encoded strings when creating file paths. However, when reading them you can't assume any encoding (aside for the special characters mentioned in that article). For accessing the filesystem, just treat paths as an opaque blob of data. When displaying a name to the user, assume UTF-16 encoding but handle any decoding errors (e.g. by using replacement characters where neceeary).


Oh, I meant, did you have any links from official docs that said UTF-16 was used?

Your advice is fine, but when the rest of the world is UTF-8 (including the regex engine), things become quite a bit trickier!


Oh I see. UTF-16 is the preferred encoding for all new applications: https://docs.microsoft.com/en-us/windows/desktop/intl/unicod...

Basically, in Windows land, unicode means UTF-16 unless code pages are mentioned https://docs.microsoft.com/en-us/windows/desktop/intl/code-p...


On Windows the path separator is U+005c, it's only a backslash in most codepages, but not all: https://devblogs.microsoft.com/oldnewthing/20051014-20/?p=33... which links to a dead link; copy here http://archives.miloush.net/michkap/archive/2005/09/17/46994...

That doesn't change just because Unicode renders individual codepages obsolete, it's now special-cased into Windows that Japanese and Korean situations display U+005c as a currency symbol instead of a backslash.

There's also [System.IO.Path]::AltDirectorySeparatorChar which is `/` because Windows is often fine with / as a path separator as well.


When you say 'explode' what do you mean? I can see rendering being a problem, but then if someone decided to use cuneiform in their file names I'd guess many people would have problems rendering that. Surely as long as there's internal consistency???

Now I could see a mechanism where you UTF-8 code could 'explode' latin1 only software.


> When you say 'explode' what do you mean?

You will get runtime errors or data loss/corruption.

> then if someone decided to use cuneiform in their file names I'd guess many people would have problems rendering that. Surely as long as there's internal consistency???

That's not the issue. The issue is that neither UNIX nor Windows file names are guaranteed to be valid unicode (I think OSX's are):

* UNIX filenames are semi-arbitrary bags of bytes, there is no guarantee whatsoever those bags will be utf8-compatible in any way.

* Windows file names are semi-arbitrary bags of UTF16 code units, meaning they can contain unpaired surrogates, meaning they can't be decoded to unicode and thus can't be transcoded to UTF8.

Which means the conversion to unicode will either be partial (it will error out) or they will be lossy (the data will not round-trip).

Either way, it'll cause intractable issues for the developer who will either have filesystem APIs blowing up in their face with little to no way of handling it, or the data they return will not necessarily be usable down the line.


On the topic of unpaired surrogates, that a problem WTF-8 (https://simonsapin.github.io/wtf-8/) is intended to help solve.

The spec was created for Servo/Rust, but it's a sane general internal representation that should let people interact with platform APIs in a lossless manner.


> On the topic of unpaired surrogates, that a problem WTF-8 (https://simonsapin.github.io/wtf-8/) is intended to help solve.

Yes. And it does so just fine. But you probably don't want your core string type to be that, so it's used as part of the "third way" where filenames are not strings, so that windows filename are relatively cheaply convertible to strings: by transcoding to wtf8 upfront, converting from filenames to strings is just UTF8 validation; and converting from UTF8 to filename is free. And likewise for "byte array" unix filenames.


"UNIX filenames are semi-arbitrary bags of bytes"

That's what I'm thinking (most experience is with Linux). So it isn't as if π is represented internally as 'π', it is just a bag of bytes, so trying to make more sense of it than that, is in a sense wrong.

Edit: I guess I'm assuming the seemingly obvious step of checking for valid input? I mean if you get a bag of bytes and start trying to do utf-8 things on it, without checking for errors.... Is that what were talking about here?


> yes I understand you can "just do X" where "X" is some complicated thing to avoid the bug if you remember to do "X" beforehand

Either you get these bugs when working with unicode, or you get them when working with strings that are not unicode-compatible. It is an implementation trade-off, and unlike you and the author, I prefer the python way.


It's like how you can't use perl6 syntax for bencoded data (a length-prefixed/T(L)V format used in .torrent files), because the format contains some raw binary unsigned integers that make it invalid utf-8 in practice.


Are you sure UTF8-C8[1] doesn't cover this use case?

[1] https://docs.perl6.org/language/unicode#UTF8-C8




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: