More

gjvc · 2026-03-21T09:46:54 1774086414

making it nice and fast; hello. it was nice and fast on windows 2000 and they've been fucking it up ever since.

gjvc · 2026-03-21T02:17:57 1774059477

ended with windows 2000

gjvc · 2026-03-20T03:13:44 1773976424

useless

gjvc · 2026-03-20T03:13:35 1773976415

useless

gjvc · 2026-03-20T03:13:27 1773976407

useless

gjvc · 2026-03-17T22:57:30 1773788250

yes. it was not a massive shift. it was barely worth the effort.

pansa2 · 2026-03-17T23:19:38 1773789578

The Python devs didn’t want to make huge changes because they were worried Python 3 would end up taking forever like Perl 6. Instead they went to the other extreme and broke everyone’s code for trivial reasons and minimal benefit, which meant no-one wanted to upgrade.

Even the main driver for Python 3, the bytes-Unicode split, has unfortunately turned out to be sub-optimal. Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.

diziet_sma · 2026-03-18T00:32:49 1773793969

> Python essentially bet on UTF-32 (with space-saving optimisations)

How so? Python3 strings are unicode and all the encoding/decoding functions default to utf-8. In practice this means all the python I write is utf-8 compatible unicode and I don't ever have to think about it.

sheept · 2026-03-18T00:54:00 1773795240

UTF-32 allows for constant time character accesses, which means that mystr[i] isn't O(n). Most other languages can only provide constant time access for code units.

msl · 2026-03-18T08:06:49 1773821209

UTF-32 allows for constant time access to code points. Neither UTF-8 nor UTF-16 can do the same (there are 2 to the power of 20 valid code points, though not all are in use).

While most characters might be encodable as a single code point, Python does not normalize strings, so there is no guarantee that even relatively normal characters are actually stored as single code points.

Try this in Python:

  s = "a\u0308"
  print(s)
  print(s[0])

You will see:

  ä
  a

cloudbonsai · 2026-03-18T03:43:51 1773805431

Internally Python holds a string as an array of uint32. A utf-8 representation is created on demand from it (and cached). So pansa2 is basically correct [^1].

IMO, while this may not be optimal, it's far better than the more arcane choice made by other systems. For example, due to reasons only Microsoft can understand, Windows is stuck with UTF-16.

[1] Actually it's more intelligent. For example, Python automatically uses uint8 instead of uint32 for ASCII strings.

zahlman · 2026-03-18T07:19:43 1773818383

There is no caching of a "utf-8 representation". You may check for example:

  >>> x = '日本語'*100000000
  >>> import time
  >>> t = time.time(); y = x.encode(); time.time() - t # takes nontrivial time
  >>> t = time.time(); y = x.encode(); time.time() - t # not cached; not any faster

Generally, the only reason this would happen implicitly is for I/O; actual operations on the string operate directly on the internal representation.

Python uses either 8, 16 or 32 bits per character according to the maximum code point found in the string; uint8 is thus used for all strings representable in Latin-1, not just "ASCII". (It does have other optimizations for ASCII strings.)

The reason for Windows being stuck with UTF-16 is quite easy to understand: backwards compatibility. Those APIs were introduced before there supplementary Unicode planes, such that "UTF-16" could be equated with UCS-2; then the surrogate-pair logic was bolted on top of that. Basically the same thing that happened in Java.

cloudbonsai · 2026-03-18T09:56:17 1773827777

> There is no caching of a "utf-8 representation".

No there certainly is. This is documented in the official API documentation:

    UTF-8 representation is created on demand and cached in the Unicode object.

    https://docs.python.org/3/c-api/unicode.html#unicode-objects

In particular, Python's Unicode object (PyUnicodeObject) contains a field named utf8. This field is populated when PyUnicode_AsUTF8AndSize() is first called and reused thereafter. You can check the exact code I'm talking about here:

https://github.com/python/cpython/blob/main/Objects/unicodeo...

Is it clear enough?

zahlman · 2026-03-18T19:26:35 1773861995

The C API may provide for it, but I'm not seeing a way to access that from Python. This sort of thing is provided for people writing C extensions who need to interface to other C code.

(And the code search seems to be broken; it can't find me the definition of `unicode_fill_utf8` although I'm sure it's obvious enough.)

nslsm · 2026-03-18T04:41:03 1773808863

Read first paragraph here https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...

pansa2 · 2026-03-18T01:07:26 1773796046

> all the encoding/decoding functions default to utf-8

Languages that use UTF-8 natively don't need those functions at all. And the ones in Python aren't trivial - see, for example, `surrogateescape`.

As the sibling comment says, the only benefit of all this encoding/decoding is that it allows strings to support constant-time indexing of code points, which isn't something that's commonly needed.

laurencerowe · 2026-03-18T01:34:55 1773797695

They absolutely do because random byte strings are not valid utf8. Safe Rust requires validating bytes when converting to strings because this.

zahlman · 2026-03-18T07:11:32 1773817892

> Python essentially bet on UTF-32 (with space-saving optimisations), while everyone else has chosen UTF-8.

It did nothing of the sort. UTF-8 is the default source file encoding and has been the target for many APIs. It likely would have been the default for all I/O stuff if we lived in a world where Windows had functioning Unicode in the terminal the whole time and didn't base all its internal APIs on UTF-16.

I assume you're referring to the internal representation of strings. Describing it as "UTF-32 with space-saving optimizations" is missing the point, and also a contradiction in terms. Yes, it is a system that uses the same number of bytes per character within a given string (and chooses that width according to the string contents). This makes random access possible. Doing anything else would have broken historical expectations about string slicing. There are good arguments that one shouldn't write code like that anyway, but it's hard to identify anything "sub-optimal" about the result except that strings like "I'm learning 日本語" use more memory than they might be able to get away with. (But there are other strings, like "ℍℯℓ℗", that can use a 2-byte width while the UTF-8 encoding would add 3 bytes per character.)

rjh29 · 2026-03-18T00:16:58 1773793018

Ironically Perl 5 managed to do the bytes-Unicode split with a feature gate, no giant major version change.

gjvc · 2026-03-18T05:19:18 1773811158

this must be right, i'm getting downvoted

boxed · 2026-03-18T06:21:34 1773814894

It's wrong. Python3 eliminated mountains of annoying bugs that happened all over the code base because of mixing of unicode strings and byte strings. Python2 was an absolute mess.

zahlman · 2026-03-18T07:20:22 1773818422

Please don't do this.

gjvc · 2026-03-15T03:48:25 1773546505

would have been better without Cook's signature

gjvc · 2026-03-15T03:48:07 1773546487

would have been better without Cook's signature

gjvc · 2026-03-15T03:46:15 1773546375

hoped it would be Steve Crocker; was disappointed.

gjvc · 2026-03-06T18:48:30 1772822910

as long as you get your correction in that's all that matters. no respect for the dead have you.

parl_match · 2026-03-06T19:26:47 1772825207

maybe it's just me, but it shows more respect when people are getting the date right