Let's talk about WSGI

ajross · on Aug 10, 2009

FTA:

The WSGI spec impresses upon its readers (or upon this reader, at least) the overwhelming desire for everybody to just quiet down and use ISO-8859-1 instead of whatever character set is actually convenient.

Is this really true? What a disaster. If you're going to pick/encourage a single encoding, how can that choice not be UTF-8?

andyn · on Aug 10, 2009

Looking at PEP333, the only reference to ISO-8859-1 is:

Note also that strings passed to start_response() as a status or as response headers must follow RFC 2616 with respect to encoding. That is, they must either be ISO-8859-1 characters, or use RFC 2047 MIME encoding.

start_response is to set the HTTP headers. So I suspect that's a requirement of the HTTP spec for headers (correct me if I'm wrong).

I don't believe WSGI cares about the character set you use for the HTTP message body.

ubernostrum · on Aug 10, 2009

Well, that and all the places where it goes on about how all "strings" it mentions must be bytestrings (even on platforms which don't have bytestrings), must only contain code points lower than 0xff, must be str, precisely str and neither any other type of string nor any subtype... and all of it to try to pretend that Unicode doesn't exist.

Yeah, HTTP is a byte-based protocol, and yeah, headers have to be ISO-8859-1 or MIME-encoded. But that doesn't mean that the particular bytes HTTP uses have to leak up into what are supposed to be high-level applications. There's no earthly reason why -- with every Python implementation moving to native Unicode strings -- WSGI should still have this attitude.

ajross · on Aug 10, 2009

OK, I'll byte: what python platforms lack bytestrings? Is there an PDP-10 port I missed somewhere? :)

ubernostrum · on Aug 10, 2009

CPython 2.x and Python implementations built directly on it have bytestrings. PyPy does, as far as I know. And Unladen Swallow is a CPython fork, so it does.

But everything else either has strings which are natively Unicode (and hence not byte-based in the sense we want here) or run on platforms where the underlying string abstraction is natively Unicode. This includes Jython, IronPython, Python 3.x, etc.

WSGI would like very much for those platforms to do something else, because it has to expend a bit of verbiage to basically say "shame on you for having these dangerous Unicode strings, don't you dare try to take advantage of them!"