I don't understand this comment. The hash method changed under different release...

lqdc13 · on Nov 18, 2015

Ok, try running this in a program like so:

    d = {chr(i):i for i in range(65,91)}
    print(d)

Do it with python2 and python3. You'll see that the output in python3 changes every time.

If someone was relying on consistent ordering, they're going to have a bug.

Python2's ordering is deterministic[0]

[0]https://docs.python.org/2/library/stdtypes.html#dict.items

dalke · on Nov 19, 2015

Okay, we're talking about the same thing. As the documentation points out:

> If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond.

If you restart Python, you have broken the correspondence.

You'll note that either the documentation is incomplete or your interpretation is incorrect, as the most recent versions of 2.6 and 2.7 will use a randomized hash table when the -R flag is enabled:

    % ~/Python-2.7.10/python.exe -R x.py
    {'M': 77, 'L': 76, 'O': 79, 'N': 78, 'I': 73, 'H': 72,
     'K': 75, 'J': 74, 'E': 69, 'D': 68, 'G': 71, 'F': 70,
     'A': 65, 'C': 67, 'B': 66, 'Y': 89, 'X': 88, 'Z': 90,
     'U': 85, 'T': 84, 'W': 87, 'V': 86, 'Q': 81, 'P': 80,
     'S': 83, 'R': 82}
    % ~/Python-2.7.10/python.exe -R x.py
    {'Z': 90, 'Y': 89, 'X': 88, 'W': 87, 'V': 86, 'U': 85,
     'T': 84, 'S': 83, 'R': 82, 'Q': 81, 'P': 80, 'O': 79,
     'N': 78, 'M': 77, 'L': 76, 'K': 75, 'J': 74, 'I': 73,
     'H': 72, 'G': 71, 'F': 70, 'E': 69, 'D': 68, 'C': 67,
     'B': 66, 'A': 65}

I can totally understand how people expect an invariant order. As I pointed out, our regression code broke in the 2.x series because we relied on consistent ordering, and CPython never made that promise. But what I quoted above is the only guarantee about dictionary order. Everything else is an implementation accident.

Nor is it the only such implementation-specific behavior that people sometimes depend on.

  >>> for c in "This is a test":
  ...   if c is "i": print "Got one!"
  ... 
  Got one!
  Got one!

That's under CPython, where single character strings with chr(c)<256 use an intern table. Pypy doesn't print anything because it doesn't use that mechanism.

Note that 'is' testing is also faster:

    % python -mtimeit -s 's="testing 1, 2, 3."*1000' 'sum(1 for c in s if c is "t")'
    1000 loops, best of 3: 893 usec per loop
    % python -mtimeit -s 's="testing 1, 2, 3."*1000' 'sum(1 for c in s if c == "t")'
    1000 loops, best of 3: 1.01 msec per loop

This extra 10% is sometimes attractive.

lqdc13 · on Nov 19, 2015

Sure, but why did you enable the -R flag?

We are talking about the default way of doing things in the most commonly by far used implementation.

I'm not saying someone should have relied on the specific ordering or that the code that does rely on it is a great way of doing things.

CPython2 did make that promise -

    CPython implementation detail: Keys and values are listed in an arbitrary order which is non-random, varies across Python implementations, and depends on the dictionary’s history of insertions and deletions.

In other words, the order should be the same no matter how many times you restart the application.

It is a subtle source of bugs since it always worked in each specific version of CPython2 without the -R flag.

dalke · on Nov 19, 2015

"why did you enable the -R flag?"

Because either the documentation means to include -R in the description, in which case your interpretation of the documentation is incorrect, or the documentation is incomplete because it doesn't describe a valid CPython 2.x run-time. Either way, it indicates that the difference isn't, strictly speaking, a Python2/3 issue.

"an arbitrary order"

Where does it say that the arbitrary order must be consistent across multiple invocations? Quoting from https://docs.python.org/2/using/cmdline.html#cmdoption-R :

> Changing hash values affects the order in which keys are retrieved from a dict. Although Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds), enough real-world code implicitly relies on this non-guaranteed behavior that the randomization is disabled by default.

I totally understand your point. I remember the debates about how this would break code. But it's there to mitigate algorithmic complexity attacks against an every increasing attack surface. This was the best solution they come up with, along with a migration path to the new default.