Hacker News new | past | comments | ask | show | jobs | submit login

    with open('some latin-1 file', 'rb) as f:
      text = f.read().decode('latin-1')
    with open('some utf8 file', 'wb') as f:
      f.write(text.encode('utf-8'))
Python 3's string encoding support is super good. I've said it before and I'll say it again: if you use bytes as a string you are Doing It Wrong.

If you use bytes as a string you are Doing It Wrong.

If you use bytes as a string you are Doing It Wrong.




Allow me to rephrase.

I do that operation on a file I get from an API. I know for a fact that the encoding I'm receiving is latin-1.

I run exactly that operation on the file that you wrote out in code.

When I try to read that file back in as UTF-8, I get encoding errors. That does not make for "super good." That makes me want to scream.

I do not have this problem when I use Python 2.7.x


You're definitely making a mistake somewhere, because I just tested it for myself and it worked perfectly fine. I made a latin-1 file, applied the above code with it, and got a correct utf-8 file out. Are you reading the final file back as latin-1? You have to read it as utf-8 of course.


And what happens when you don't get a choice about what strings you are digesting?


I don't follow. What do you mean?

To be perfectly clear: bytes (b'') is not a string. Again: bytes is NOT a string. It is an array of octets, aka bytes, aka unsigned 8 bit integers. NOT characters. NOT a string.

If you are dealing with bytes that are encoded representations of a string, then you have to know what encoding they use to decode them and treat them as strings.


I'm not sure what you mean. If you don't know what the encoding of the input file is you have a problem. As far as I know there are libraries to guess the encoding, but it cannot be determined completely accurate.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: