The Python devs didn’t want to make huge changes because they were worried Python 3 would end up taking forever like Perl 6. Instead they went to the other extreme and broke everyone’s code for trivial reasons and minimal benefit, which meant no-one wanted to upgrade.
Even the main driver for Python 3, the bytes-Unicode split, has unfortunately turned out to be sub-optimal. Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.
> Python essentially bet on UTF-32 (with space-saving optimisations)
How so? Python3 strings are unicode and all the encoding/decoding functions default to utf-8. In practice this means all the python I write is utf-8 compatible unicode and I don't ever have to think about it.
UTF-32 allows for constant time character accesses, which means that mystr[i] isn't O(n). Most other languages can only provide constant time access for code units.
UTF-32 allows for constant time access to code points. Neither UTF-8 nor UTF-16 can do the same (there are 2 to the power of 20 valid code points, though not all are in use).
While most characters might be encodable as a single code point, Python does not normalize strings, so there is no guarantee that even relatively normal characters are actually stored as single code points.
Internally Python holds a string as an array of uint32. A utf-8 representation is created on demand from it (and cached). So pansa2 is basically correct [^1].
IMO, while this may not be optimal, it's far better than the more arcane choice made by other systems. For example, due to reasons only Microsoft can understand, Windows is stuck with UTF-16.
[1] Actually it's more intelligent. For example, Python automatically uses uint8 instead of uint32 for ASCII strings.
There is no caching of a "utf-8 representation". You may check for example:
>>> x = '日本語'*100000000
>>> import time
>>> t = time.time(); y = x.encode(); time.time() - t # takes nontrivial time
>>> t = time.time(); y = x.encode(); time.time() - t # not cached; not any faster
Generally, the only reason this would happen implicitly is for I/O; actual operations on the string operate directly on the internal representation.
Python uses either 8, 16 or 32 bits per character according to the maximum code point found in the string; uint8 is thus used for all strings representable in Latin-1, not just "ASCII". (It does have other optimizations for ASCII strings.)
The reason for Windows being stuck with UTF-16 is quite easy to understand: backwards compatibility. Those APIs were introduced before there supplementary Unicode planes, such that "UTF-16" could be equated with UCS-2; then the surrogate-pair logic was bolted on top of that. Basically the same thing that happened in Java.
> There is no caching of a "utf-8 representation".
No there certainly is. This is documented in the official API documentation:
UTF-8 representation is created on demand and cached in the Unicode object.
https://docs.python.org/3/c-api/unicode.html#unicode-objects
In particular, Python's Unicode object (PyUnicodeObject) contains a field named utf8. This field is populated when PyUnicode_AsUTF8AndSize() is first called and reused thereafter. You can check the exact code I'm talking about here:
The C API may provide for it, but I'm not seeing a way to access that from Python. This sort of thing is provided for people writing C extensions who need to interface to other C code.
(And the code search seems to be broken; it can't find me the definition of `unicode_fill_utf8` although I'm sure it's obvious enough.)
> all the encoding/decoding functions default to utf-8
Languages that use UTF-8 natively don't need those functions at all. And the ones in Python aren't trivial - see, for example, `surrogateescape`.
As the sibling comment says, the only benefit of all this encoding/decoding is that it allows strings to support constant-time indexing of code points, which isn't something that's commonly needed.
> Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.
It did nothing of the sort. UTF-8 is the default source file encoding and has been the target for many APIs. It likely would have been the default for all I/O stuff if we lived in a world where Windows had functioning Unicode in the terminal the whole time and didn't base all its internal APIs on UTF-16.
I assume you're referring to the internal representation of strings. Describing it as "UTF-32 with space-saving optimizations" is missing the point, and also a contradiction in terms. Yes, it is a system that uses the same number of bytes per character within a given string (and chooses that width according to the string contents). This makes random access possible. Doing anything else would have broken historical expectations about string slicing. There are good arguments that one shouldn't write code like that anyway, but it's hard to identify anything "sub-optimal" about the result except that strings like "I'm learning 日本語" use more memory than they might be able to get away with. (But there are other strings, like "ℍℯℓ℗", that can use a 2-byte width while the UTF-8 encoding would add 3 bytes per character.)
It's wrong. Python3 eliminated mountains of annoying bugs that happened all over the code base because of mixing of unicode strings and byte strings. Python2 was an absolute mess.
reply