Hacker News new | past | comments | ask | show | jobs | submit login

Keep in mind that filesystem paths aren't strings. On Linux, they are raw bytes without any fixed encoding (but usually UTF-8 on UTF-8-based locales), and on Windows, they are sequences of 16-bit codepoints which are expected to be UTF-16 but not validated.

Rust's OsStr is my favorite approach so far. It stores Linux's raw bytes as-is, and stores Windows's possibly-valid UTF-16 as WTF-8. This makes path management "just work", with the ability to operate normally on invalid UTF-8 or UTF-16 paths, and zero-copy conversion from UTF-8/ASCII strings to OsStr (though converting OsStr into UTF-16 requires parsing). (Qt's QString-based file dialogs on Linux fail to convert invalid UTF-8 paths like those in https://github.com/petrosagg/wtfiles into QString, causing Qt-based apps to open/save the wrong paths.)

However there are difficulties in printing an OsStr. For example, a file dialog that shows filenames as raw bytes can't show non-Latin/Unicode characters in a human-readable form, and a file dialog that shows filenames as Unicode strings can't handle invalid Unicode filenames. GTK3 file dialogs show filenames as Unicode strings, and when encountering files with invalid Unicode names, instead displays "file�name.txt (invalid encoding)".

Worse yet, how should a file dialog allow users to rename files? If it's based around byte arrays, the user can't enter Unicode characters directly, and if it's based around Unicode (or a locale-specific text encoding), it can't display existing files with invalid Unicode/etc. in the name (probably not an issue if it allows the user to rename to a valid name), nor allow users to enter invalid Unicode (which is not an issue IMO).




The difficulty of printing OsStr is nothing to do with Rust, it's really just the difficulty of printing Linux file names in any context given you don't know what the non-UTF8 bytes mean.


It's also a good rationale for why filenames should be valid Unicode - because they're data displayed to the end user.


MacOS filesystem is utf-8 strings with valid Unicode glyphs that are pinned to a certain revision of Unicode.


> MacOS filesystem is utf-8 strings

APFS is utf8, HFS+ is utf16.

> with valid Unicode glyphs

That doesn’t really mean anything.

Apple’s fs do guarantee the paths are correct, as in, valid whatever encoding this has nothing to do with glyphs.

APFS also does not perform any normalisation while HFS+ uses a custom variant if NFD. While HFS+’s normalisation has its issues and critics, APFS’s lack of normalisation is probably worse: https://eclecticlight.co/2017/04/06/apfs-is-currently-unusab...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: