Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

This is exactly why I hate the way Python3 handles Unicode.

EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after. NONE should ever FORCE validation, since sometimes, like in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work. Sometimes the error is trying to enforce that validation.



How is this Python's fault? It's not like the `docker-compose` file would have worked any better if it silently replaced one of the volumes with an inaccessible file. Instead, you'd just get a failure from the Windows filesystem API when you tried to access or create a file at "C:\\Users\\Miko�aj\\AppData\\Local\\JetBrains\\Rider2021.2\\log\\DebuggerWorker\\\", right?


Not sure about Windows, but on a real operating system that's:

  $ ln -sT Mikołaj/ Miko�aj
That certainly isn't good, but it is

> would have worked any better


Your think Windows doesn't have symlinks? Snobbish and wrong, nice combo


You think "I'm not sure" means "I think it doesn't"? Pot, kettle; I'm sure you'll hit it off nicely.


Python 3 usually handles this correctly, and I'm a little bit confused what's going on in the article, exactly.

For UNIX path names (and other OS data like environment variables), Python uses the "surrogateescape" error handling method, which does exactly what you ask. Any byte sequence can be converted to a string. If it decodes as valid UTF-8, it will do that. If it hits a byte that does not decode as valid UTF-8 (necessarily a byte >= 128), it will map it to code points U+DC80 through U+DCFF. These are in a reserved ranges of code points ("surrogates", which make it possible to represent code points > 0xFFFF in UTF-16), and they can't show up in actual Unicode text (i.e., there is no UTF-8 encoding of them, strictly speaking, and if you applied the UTF-8 encoding algorithm to a code point in the U+D800 to U+DFFF range, you would get bytes that aren't valid UTF-8).

On the way out, this is reversed. So you get the results you expect if your filenames are in UTF-8, but since UNIX has no requirement that filenames are indeed UTF-8 (the only constraint is they can't contain NUL or ASCII-forward-slash), the bytes are preserved in a funky-looking format in Python and you get the exact same output on the other end.

See https://www.python.org/dev/peps/pep-0383/ for more on what's going on. The tl;dr for users of Python is that if you want to interact with, say, subprocess output as mostly-normal strings (instead of bytes) but you want to be robust to non-UTF-8 bytes, you should do something like

    subprocess.check_output(["some", "command"], errors="surrogateescape")
You don't need to do this for APIs that directly interact with pathnames, because they do it already. You just need to do it for things like subprocess output and file contents that Python doesn't know you want to handle in this way.

...

On Windows, however, path names must be valid Unicode and are stored in UTF-16. So the idea of a "ł" that doesn't decode properly shouldn't even happen! Mikołaj's home directory ought to be a very boring (and valid) 004d 0069 006b 006f 0142 0061 006a on disk.

Windows doesn't enforce that file paths are valid UTF-16 though (specifically, the surrogate code points are only supposed to show up in a certain way, but nothing enforces that and you can have random surrogates on disk), and hence Rust, which internally represents all strings in UTF-8, has a solution ("WTF-8") that's basically the inverse of surrogateescape - it uses extrapolated-UTF-8-encoding-of-surrogates to handle unpaired surrogates. http://simonsapin.github.io/wtf-8/ But it seems very odd to me that the directory C:\Users\Mikołaj would actually contain any of those, and if it doesn't, I would expect it to very easily turn into a Python Unicode string.

Maybe this is from a Python version before https://www.python.org/dev/peps/pep-0529/ , which is claimed to "fail to round-trip characters outside of the user's active code page"? Maybe this is from a Python version after that change and it's wrong?


The incorrect docker-compose file was generated by Java (Jetbrains) but consumed by Python (docker-compose). The GP comment was complaining about Python's strict Unicode consumption, not Java's invalid Unicode generation.


The Docker compose file is YAML. My reading of YAML's standard is that it must be in one of the Unicode encodings, and the smell I get from the article is that it is probably in windows-1250 (the CP Windows would use for Polish; Mikołaj is a Polish name, 0xb3, the octet in the error, is the Windows-1250 encoding of "ł"); thus, it isn't valid YAML.

I'm not sure what sane behavior Python could have here besides errorring.

> EVERY language should _try_ to handle Unicode such that if a data sequence were valid before it remains valid after.

This sequence was never valid, and never will be.

> in the article's case, the correct answer is GIGO. Just pass it through and hope it continues to work.

Dear God, no; emit a diagnostic and abort. Countless decades of existing code have shown time and again that "plow forward with some hot garbage" is not a good idea. But that ignores that … that that isn't how any of this works; the YAML parse is going to want to emit strings, which the incoming data isn't.


> This sequence was never valid

Neither were four-byte UTF-8 characters at some point.

> and never will be.

We shall see.


Oh, I see. But if it was UTF-8 it would have worked... I guess the problem is that JetBrains is generating the file in (e.g.) Windows-1252, and Python needs to be told that?

Does it work if you set the environment variable PYTHONENCODING to cp1252?

(I suppose I should either contact the author, or try it myself...)


JetBrains is generating an invalid YAML file, which are UTF-8. If they were using a decent YAML library, it would have crashed at that point. And firmly pointed the finger at the real bug, reading raw bytes from the environment or a .properties file parser and assuming it is valid UTF-8.

And this is why you always validate your data when you slurp it in, or else you pass crap down several layers where it crashes or mostly works with the potential for security holes or catastrophic behavior, and a pain in the arse to track down since the actual bug is nowhere near where you are looking.


I'd expect a decent YAML library to have functions taking UTF-8 and not wasting time verifying that the data passed is actually UTF-8 in release builds.


You generally don't verify on output, because you verified on input (especially with languages where text strings are Unicode or UTF-8 byte strings like Python3 or Rust). But it would also be a premature optimization when it does make sense to check. For expected YAML use cases I doubt it would be a measurable difference in runtime. And it has to inspect the strings in any case to correctly quote things and deal with indentation if there are newlines.


Normally I'd agree, a windows-1252 misencode should be one's default guess when mojibake is afoot. Unfortunately, the errant byte in the error is 0xb3, which is "³" in windows-1252.

If you Google, "Mikołaj",

> Mikołaj is the Polish cognate of given name Nicholas

Then Google, "windows character encoding polish"

> Windows-1250 - Wikipedia

And 0xb3 is "ł" in that encoding.¹

> Does it work if you set the environment variable PYTHONENCODING (sic) to cp1252?

I don't know if setting PYTHONIOENCODING would work here; I don't think it should affect this. Really, fixing the YAML file is the fix. (And fixing the thing that generated it.)

¹it is queries like this that really make me love the search engines of today. This would have been hell in the days of Alta Vista.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: