with open('some latin-1 file', 'rb) as f: text = f.read().decode('latin-1') with...

ianamartin · on Dec 11, 2016

Allow me to rephrase.

I do that operation on a file I get from an API. I know for a fact that the encoding I'm receiving is latin-1.

I run exactly that operation on the file that you wrote out in code.

When I try to read that file back in as UTF-8, I get encoding errors. That does not make for "super good." That makes me want to scream.

I do not have this problem when I use Python 2.7.x

ddevault · on Dec 11, 2016

You're definitely making a mistake somewhere, because I just tested it for myself and it worked perfectly fine. I made a latin-1 file, applied the above code with it, and got a correct utf-8 file out. Are you reading the final file back as latin-1? You have to read it as utf-8 of course.

ianamartin · on Dec 10, 2016

And what happens when you don't get a choice about what strings you are digesting?

ddevault · on Dec 10, 2016

I don't follow. What do you mean?

To be perfectly clear: bytes (b'') is not a string. Again: bytes is NOT a string. It is an array of octets, aka bytes, aka unsigned 8 bit integers. NOT characters. NOT a string.

If you are dealing with bytes that are encoded representations of a string, then you have to know what encoding they use to decode them and treat them as strings.

minus7 · on Dec 10, 2016

I'm not sure what you mean. If you don't know what the encoding of the input file is you have a problem. As far as I know there are libraries to guess the encoding, but it cannot be determined completely accurate.