Efficiently Generating Python Hash Collisions

svat · on July 24, 2019

Although this article is about collisions in the hash function applied to strings, for numeric values Python uses a hash() function that is not only easy to generate collisions for but actually to invert (and find all inverses). I learned about this when writing this answer:

• https://stackoverflow.com/questions/56227419/why-does-python...

and details and code on how to invert are here:

• https://stackoverflow.com/a/56248241/4958

(Needless to say, this is quite straightforward and trivial compared to collisions on strings; nevertheless it may be of some interest to someone.)

ErikCorry · on July 23, 2019

Why is the running time of the attacked python process exponential? I would expect quadratic.

6gvONxR4sf7o · on July 23, 2019

It's probably using the colloquial version of the word (which I hate). "Exponentially" = "hugely, but I want to sound technical." Or the still frustrating but more acceptable "exponentially" = "by orders of magnitude."

sansnomme · on July 24, 2019

Maybe OP should use P and NP instead.

dataflow · on July 24, 2019

Was wondering the same thing. My guess is he just meant 1000 -> 100 -> 10 -> 1 -> exponential.

rurban · on July 24, 2019

It's actually linear. It's searching a key in a linked list O(n), instead of constant access O(1). Still the amplification factor is big enough.

josefx · on July 24, 2019

Creating the dict performs n * O(n) searches in total, so wouldn't it be O(n^2) ?

dataflow · on July 24, 2019

Depends on if "it" refers to the creation of the dict or to lookups thereafter...

josefx · on July 24, 2019

I mean the attack as a whole. I think having a linear lookup alone wouldn't be as problematic if it didn't turn the creation of the dictionary itself quadratic. The attack wouldn't scale near as well if it ran in O(n).

petters · on July 23, 2019

I think that the author is mistaken or careless when making that particular statement.

jshowa3 · on July 23, 2019

Wow. This is an awesome article. Very well put together and explained step by step with pictures.

Wish all articles were like this.

punnerud · on July 23, 2019

How is it fixed in Python 3.3+?

mcintyre1994 · on July 23, 2019

I think it's this:

> Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

> This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

> Changing hash values affects the iteration order of sets. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

> See also PYTHONHASHSEED.

--

From https://docs.python.org/3/reference/datamodel.html#object.__... (linked from https://docs.python.org/3/whatsnew/3.3.html)

H8crilA · on July 24, 2019

Yup, don't ever rely on any random hash function, even if widely used, to be repeatable between processes. I've seen a deployment of a small caching system that used the Python's built in hash function. The author of the code was wondering why does he get a ~zero hit rate (processes were short lived in that setup).

Related: iterating over hash maps is a very common way to get nondeterministic output. I've removed a lot of inconsistency bugs by fixing the hash map iteration order. Usually by just using the sorted map, since the performance of a randomly chosen piece of code in a large system is almost irrelevant.

osandov · on July 24, 2019

It was actually fixed in Python 3.4 with the switch to use SipHash, as the randomization introduced in Python 3.3 was still susceptible to some attacks. See https://www.python.org/dev/peps/pep-0456/

rurban · on July 24, 2019

Still not secure if you can read out the seed. Brute forcing SipHash collisions with the known seed is also just a matter of seconds to minutes. http://perl11.org/blog/seed.html

A proper security fix can only be to fix the collision resolution, never using a slower hash function. It's also 10x faster then.

minitech · on July 24, 2019

> Still not secure if you can read out the seed.

Which you do as an attacker by… asking politely? Or is it easy to leak the seed by accident?

In other news, there’s a relatively low-cost attack on AES when you know the key.

rurban · on July 25, 2019

With dynamic languages like python it's trivial. Calculate the position and peek it.

With static languages it's still easy when you got enough information: source code, timing info and ordering (e. g JSON). With a proper SAT solver doable.

Leaking by accident is e.g even more trivial in perl, just set a magic ENV var. But peeking the fixed offset is easiest.

It's all just security theatre. There's no real use-case for something so slow as Siphash.

minitech · on July 25, 2019

Do you often give DoS threats arbitrary code execution or the ability to set environment variables? Because there’s no need to bother with hash collisions to DoS yourself.

  while True: pass

As for finding the seed with a SAT solver: no, not doable. Not from actual hashes, let alone timing and order hints at actual hashes.

tromp · on July 24, 2019

Finding (cycles of) collisions in siphash is used as a memory-hard Proof-of-Work:

https://github.com/tromp/cuckoo

rurban · on July 25, 2019

Only for hash tables which use all bits. (e.g. prime sized).

Almost nobody does. For normal power-of-2 hash tables finding collisions in the lowest bits leading to a successful amplification is trivial by brute-force. Since Siphash is so slow it can last up to 4 minutes, for smaller tables still < 5s.

brian_spiering · on July 23, 2019

"hash randomization is enabled by default" https://docs.python.org/3/whatsnew/3.3.html