Codec Confusion in Python

loqi · on Aug 11, 2012

That's the most unfortunate Python 3 change I've seen. I use byte codecs like hex, zlib, and base64 quite a bit more than text codecs. In Python 2, a programmer with forward-compatible habits can write

  from __future__ import unicode_literals
  from io import open

with the understanding that migration to Python 3 will remove that boilerplate. But taking a similar approach for byte codecs requires knowledge and reference of the right module name (instead of the encoding name) and the names of the corresponding encode and decode functions (instead of just encode and decode). So we've got

  .encode('base64') -> import base64; base64.b64encode()
  .encode('zlib')   -> import zlib; zlib.compress()
  .encode('hex')    -> ?

and unlike the text boilerplate, it's a permanent uglification. I don't know of an idiomatic replacement for the last one off the top of my head. Hopefully it's something nicer and more symmetrical than

  ''.join(map('{:02x}'.format, foo)).encode('ascii')

mahmoudimus · on Aug 11, 2012

>> .encode('base64') -> import base64; base64.b64encode()

Just a nitpick on this. This is actually a Python "gotcha". You'll notice that the .encode('base64') method actually is a base-64 Content-Transfer-Encoding[1] which enforces a limit on the length of the line to 76 characters. Here's an example demonstrating the difference:

    import base64
    eighty_chars = ("X" * 80)
    assert '\n' in eighty_chars.encode('base64').strip()
    assert '\n' not in base64.b64encode(eighty_chars)

[1] http://tools.ietf.org/html/rfc3548#2.1

jlarocco · on Aug 11, 2012

I don't agree with that. IMO removing the zlib, hex, and base64 encodings was a good thing.

While an argument can be made that they're technically "encoding", they're really outside the scope of the problem the encode and decode methods were meant to solve.

jrockway · on Aug 11, 2012

Explicit is better than implicit. If you want to call base64.b64encode on a piece of data, do that.

the_mitsuhiko · on Aug 11, 2012

Just that codecs supported incremental operations and base64.b64encode did not. Handling HTTP transfer encoding in python 2 was a matter of two lines and worked on arbitrary stream data. In Python 3 that's now ~50 lines of code with different behavior for each transfer encoding and not all of them support stream processing or have the same interface.

DrJosiah · on Aug 11, 2012

It's less convenient, it's a different API for every type of transformation, and the change has made code demonstrably worse.

Further, the use of encode/decode is explicit. The only thing that was implicit was the automatic transformation of string -> unicode when people mistakenly used unicode codecs on string objects, or the reverse for string codecs on unicode objects. The proper answer to both of these is to just not do automatic type conversion... which is what was done in Python 3.

So actually, had we left all of the codec machinery intact, those codec errors described by Armin wouldn't ever occur again! Instead, you'd get a TypeError caused by passing the wrong type of object to the underlying encoder/decoder.

jrockway · on Aug 11, 2012

The API is nice, but it is difficult to maintain. To get encoders/decoders into the string class in the first place, you have to maintain a global registry. (I suppose you could pass them all to the constructor of the string object, but nobody's going to do that.) The global codec registry leads to naming conflicts. If you import a module that globally adds a "foo" encoder, then you import another module that globally adds a "foo" encoder, now what? Both modules break because of their dependency on the global name "foo". Because of the details of the codec.register implementation, you can't even catch the conflict at registration time and refuse to load the second module, you simply have to wait until your program returns subtly-incorrect results.

Compare this to the scheme where you import codec modules explicitly and just call their functions. Your imports are lexically-scoped, and if you happen to need two encoders that use the same name, you can just alias one of them at import time. This strategy can't introduce unexpected errors as your program grows larger, because the side effects are constrained to one module. It either works now and will always work, or doesn't work and fails quickly while you are developing.

Ultimately, people use Python because they want a bit of discipline in their lightweight language. This isn't Javascript or PHP, after all :)

DrJosiah · on Aug 17, 2012

Codec registration has never been an issue. Let me repeat that with some emphasis, because it's an important point. Codec registration has NEVER BEEN AN ISSUE.

And global registries are not inherently a bad thing. If you were to say "I don't want a json/pickle/messagepack decoder built into the codecs module by default", I would agree - because it's not a string/unicide <-> string/unicode transformation. But it wouldn't bother me for someone to add that support in their stuff because data.decode('json') is terribly convenient. Arguably better than peppering your code with the following (or loading it in a shared space, or injecting it into __builtins__, ...)

try: import simplejson as json

except ImportError: import json

But ultimately we are adults. If you would prefer to import a cluster of modules just to convert your strings to hex or compress your string with zlib, you are free to do so. It's just unfortunate that due to misunderstandings about the fundamental underlying problem (TypeErrors), functionality was removed.

sp332 · on Aug 11, 2012

There should be one (and preferably only one) obvious way to do it. *.encode(x) was pretty obvious. The new stuff is not.

jrockway · on Aug 11, 2012

Ultimately, adding random methods to classes introduces many subtle side effects. (See my reply to a sibling comment above.)

There's noting intrinsically obvious about making the string class responsible for encoding and decoding, other than the fact that help("") mentions the existence of that method. Most other useful utilities that operate on strings are separate classes or modules; re, for example.

DrJosiah · on Aug 11, 2012

The hex stuff is binascii.hexlify().

But I agree with you and with Armin; removing string -> string and unicode -> unicode encodings and decodings were a mistake. I said as much when the discussions about Python 3 and codecs were going on ~5 years ago.

csense · on Aug 11, 2012

I've glanced at internationalization API's at various times over the years, and I've never understood them.

You have encodings, Unicode, ASCII, UTF-8, ISO 9660, Latin-1, code pages, UTF-16, byte order masks, gettext macros, po files, ... the terminology and model of the problem domain are extremely complex and difficult to understand.

Every time I've dealt with internationalization it's been in the context of it causing strange problems and issues.

For example, one time I downloaded some tarball (I forget what it was) that had a few bytes of binary garbage at the beginning of every file. After some research I found out that it's called the BOM and has something to do with international text, and I ended up having to WRITE A SCRIPT WHICH GOES THROUGH AND DELETES THE FIRST FEW BYTES OF EVERY FILE IN A TREE in order to use the tarball's contents.

Another time, I downloaded some Java source which contained the author's name in comments. The author was German and his name contains an "o" with two dots over it. That was the only non-ASCII character in the files. Eclipse and command-line javac WOULD NOT PROCESS THE FILE and I ended up removing his name from all comments; after that it compiled without a hitch. This was the official Oracle (then Sun) javac. A fricking SOURCE TO BYTECODE COMPILER SHOULD NOT DEPEND ON YOUR SYSTEM'S NATIONALITY SETTINGS -- OR ANY LOCAL SYSTEM SETTINGS! -- TO DO ITS JOB. But it does.

Whenever you debootstrap a new Debian / Ubuntu system, using apt-get causes complaints about using the C locale until you do some magic incantation called "generating locales." Exactly what has to be generated and why the generated files can't either be included with binaries and other generated files, or auto-generated during the installation of the distro, defies explanation.

Playing Japanese import games sometimes requires you to do strange things to your Windows installation.

And of course internationalization issues are often cited as one of the things holding back many Web frameworks and other libraries from porting from Python 2 to Python 3; and of course a lack of library support has been the major showstopper for Python 3 for years now.

My advice to startups: Don't worry about non-English markets until your VC funding and/or revenue is substantial enough to support at least one full-time developer to work on the issue. A working technical understanding of internationalization is going to be a huge sink of development resources and intellectual bandwidth, which you probably can't afford while bootstrapping.

guns · on Aug 11, 2012

> You have encodings, Unicode, ASCII, UTF-8, ISO 9660, Latin-1, code pages, UTF-16, byte order masks, gettext macros, po files, ... the terminology and model of the problem domain are extremely complex and difficult to understand.

In the beginning, there was ASCII [1]. It was a simple encoding that mapped a byte stream to standard American letters, numerals, punctuation marks, as well as some common non-printing control codes.

ASCII only used the lower 7 bits of the 8-bit byte, reserving the upper 128 positions for any non-American characters needed for national encodings.

And indeed, many dozens of national character encodings appeared that used ASCII for its lower 128 positions and implemented their own character table in the upper 128. One very popular encoding was Latin-1 [2]. This became the standard encoding in much Western software because it adequately handled the most widely used Western languages.

One major problem with these 8-bit national encodings is that the upper 128 codes are mutually exclusive with one another. Confusingly, they almost all shared the base 128 ASCII codes, so programmers and users began to equate "plain text" and "sane encoding" with 7-bit ASCII, since one could effectively communicate universally by restricting the characters used to those in the printable ASCII table.

As it became clear that the proliferation of 8-bit encodings was untenable, there emerged Unicode. Unicode is not an encoding, but a standard that provides a table of universal code points, along with some recommendations about how to combine and display certain code points. [3]

Unicode is implemented in the modern day by UTF-8, UTF-16, and UTF-32, which are primarily distinguished, as you might guess, by the base size of the code unit.

UTF-32 is a simple encoding that simply maps every Unicode code point to a 32-bit sequence. This is simple to parse, but potentially very wasteful, so is rarely used.

UTF-16 uses 16-bit code units, and is able to simply translate the most commonly used portion of Unicode, the Basic Multilingual Plane. For code points above U+ffff, a scheme is used to span the code point along two code units. This encoding is used frequently in Windows, and in Java.

UTF-8 is a variable length Unicode encoding like UTF-16, but defaults to a small one-byte code unit and has a famously elegant algorithm, so it appeals strongly to miserly Unix hackers.

> For example, one time I downloaded some tarball (I forget what it was) that had a few bytes of binary garbage at the beginning of every file. After some research I found out that it's called the BOM and has something to do with international text, and I ended up having to WRITE A SCRIPT WHICH GOES THROUGH AND DELETES THE FIRST FEW BYTES OF EVERY FILE IN A TREE in order to use the tarball's contents.

The Byte Order Mark is clunky solution to the fundamental problem of divining the character encoding of an arbitrary byte stream. It's great if all your tools transparently support it, but annoying if not. However, some sort of convention or metadata is necessary to correctly encode your data. Python, Ruby, and other scripting languages have begun to coalesce around the magic encoding comment for source files (i.e. `# encoding: utf-8` as the first or second line).

Most everybody falls back to ASCII if no encoding is specified, or cannot be inferred from the stream itself. The better fallback is UTF-8, because breakages like yours are less likely to occur, which is why it is encouraged as the default system encoding in most cases.

> Another time, I downloaded some Java source which contained the author's name in comments. The author was German and his name contains an "o" with two dots over it. That was the only non-ASCII character in the files. Eclipse and command-line javac WOULD NOT PROCESS THE FILE and I ended up removing his name from all comments; after that it compiled without a hitch. This was the official Oracle (then Sun) javac. A fricking SOURCE TO BYTECODE COMPILER SHOULD NOT DEPEND ON YOUR SYSTEM'S NATIONALITY SETTINGS -- OR ANY LOCAL SYSTEM SETTINGS! -- TO DO ITS JOB. But it does.

These tools likely have ways of setting the encoding without inheriting from the environment, but they do fall back on the environment as a simple convention.

The trouble is that there is no reason any longer to assume that all text _must_ be 7-bit ASCII. Unix and programming languages are evolving to handle this new multilingual digital world. The only obstacle that really remains are programmers, so I think it's fair to spend a little time learning the basics of the subject.

[1]: There were other antediluvian encodings (like EBCDIC)

[2]: a.k.a. ISO-8859-1. Windows used a slightly modified version of this and called it Windows-1252 in order to complicate matters

[3]: The actual display of composite glyphs is left the implementor. For instance, both ready-made composite glyphs like é are provided, as well as a "non-spacing" acute accent mark

masklinn · on Aug 11, 2012

And you didn't even touch on fixed and variable-width asian character sets, like Shift-JIS (variable 1 or 2 bytes) or Big5 + extensions (ETEN or CP950, fixed 2 bytes)

derleth · on Aug 11, 2012

> UTF-8 is a variable length Unicode encoding like UTF-16, but defaults to a small one-byte code unit and has a famously elegant algorithm

... that is 100% compatible with ASCII, and, therefore, is the only Unicode encoding scheme you can safely use in filenames and to send text through applications that don't know about Unicode at all.

The primary reason is that in UTF-32 and UTF-16, the byte 0x00 may appear in the encoding of characters other than '\0'. This is not something pre-Unicode applications can deal with, because in ASCII the byte 0x00 always means '\0', which was always used for string termination.

Therefore, any application that processes ASCII and leaves non-ASCII alone (common in the real world) is instantly compatible with UTF-8.

Here's a fascinating discussion about character encodings in filenames, including comments from Linus Torvalds and Theodore Ts'o:

http://yarchive.net/comp/linux/utf8.html

The upshot is, Linux, like most if not all Unix-like OSes, speaks bytestreams, and doesn't try to interpret characters. The only rules you need, therefore, are that filenames can't contain the byte 0x2f ('/' in ASCII) except as a path separator and can't contain the byte 0x00 at all. Therefore, you need a character encoding that will not use the bytes 0x2f or 0x00 to represent characters other than '/' and '\0', respectively. The only Unicode encoding that meets that constraint is UTF-8.

Also: The elegant algorithm means that it's statistically highly, highly improbable that any text that can be interpreted as valid UTF-8 isn't UTF-8. Given that it's entirely valid to treat ASCII as UTF-8, defaulting to UTF-8 is, as you say, a good thing to do.

gitarr · on Aug 11, 2012

This just got much clearer in Python3, all string literals are now Unicode by default, making it much easier to code internationalized programs in Python.

There is an "Explicit Unicode Literal" (u"string"), to make it easier for library authors to run their libs in one codebase on Python 2 and Python 3. (In Python 3 "string" is the same as u"string")

Codecs are in their own modules, like for example base64, where they belong.

The confusions seems to be with Python 2, Python 3 has fixed it, so move on.