Hacker News new | past | comments | ask | show | jobs | submit login
Efficiently Generating Python Hash Collisions (leeholmes.com)
101 points by ssully on July 23, 2019 | hide | past | favorite | 22 comments



Although this article is about collisions in the hash function applied to strings, for numeric values Python uses a hash() function that is not only easy to generate collisions for but actually to invert (and find all inverses). I learned about this when writing this answer:

https://stackoverflow.com/questions/56227419/why-does-python...

and details and code on how to invert are here:

https://stackoverflow.com/a/56248241/4958

(Needless to say, this is quite straightforward and trivial compared to collisions on strings; nevertheless it may be of some interest to someone.)


Why is the running time of the attacked python process exponential? I would expect quadratic.


It's probably using the colloquial version of the word (which I hate). "Exponentially" = "hugely, but I want to sound technical." Or the still frustrating but more acceptable "exponentially" = "by orders of magnitude."


Maybe OP should use P and NP instead.


Was wondering the same thing. My guess is he just meant 1000 -> 100 -> 10 -> 1 -> exponential.


It's actually linear. It's searching a key in a linked list O(n), instead of constant access O(1). Still the amplification factor is big enough.


Creating the dict performs n * O(n) searches in total, so wouldn't it be O(n^2) ?


Depends on if "it" refers to the creation of the dict or to lookups thereafter...


I mean the attack as a whole. I think having a linear lookup alone wouldn't be as problematic if it didn't turn the creation of the dictionary itself quadratic. The attack wouldn't scale near as well if it ran in O(n).


I think that the author is mistaken or careless when making that particular statement.


Wow. This is an awesome article. Very well put together and explained step by step with pictures.

Wish all articles were like this.


How is it fixed in Python 3.3+?


I think it's this:

> Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

> This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

> Changing hash values affects the iteration order of sets. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

> See also PYTHONHASHSEED.

--

From https://docs.python.org/3/reference/datamodel.html#object.__... (linked from https://docs.python.org/3/whatsnew/3.3.html)


Yup, don't ever rely on any random hash function, even if widely used, to be repeatable between processes. I've seen a deployment of a small caching system that used the Python's built in hash function. The author of the code was wondering why does he get a ~zero hit rate (processes were short lived in that setup).

Related: iterating over hash maps is a very common way to get nondeterministic output. I've removed a lot of inconsistency bugs by fixing the hash map iteration order. Usually by just using the sorted map, since the performance of a randomly chosen piece of code in a large system is almost irrelevant.


It was actually fixed in Python 3.4 with the switch to use SipHash, as the randomization introduced in Python 3.3 was still susceptible to some attacks. See https://www.python.org/dev/peps/pep-0456/


Still not secure if you can read out the seed. Brute forcing SipHash collisions with the known seed is also just a matter of seconds to minutes. http://perl11.org/blog/seed.html

A proper security fix can only be to fix the collision resolution, never using a slower hash function. It's also 10x faster then.


> Still not secure if you can read out the seed.

Which you do as an attacker by… asking politely? Or is it easy to leak the seed by accident?

In other news, there’s a relatively low-cost attack on AES when you know the key.


With dynamic languages like python it's trivial. Calculate the position and peek it.

With static languages it's still easy when you got enough information: source code, timing info and ordering (e. g JSON). With a proper SAT solver doable.

Leaking by accident is e.g even more trivial in perl, just set a magic ENV var. But peeking the fixed offset is easiest.

It's all just security theatre. There's no real use-case for something so slow as Siphash.


Do you often give DoS threats arbitrary code execution or the ability to set environment variables? Because there’s no need to bother with hash collisions to DoS yourself.

  while True: pass
As for finding the seed with a SAT solver: no, not doable. Not from actual hashes, let alone timing and order hints at actual hashes.


Finding (cycles of) collisions in siphash is used as a memory-hard Proof-of-Work:

https://github.com/tromp/cuckoo


Only for hash tables which use all bits. (e.g. prime sized).

Almost nobody does. For normal power-of-2 hash tables finding collisions in the lowest bits leading to a successful amplification is trivial by brute-force. Since Siphash is so slow it can last up to 4 minutes, for smaller tables still < 5s.


"hash randomization is enabled by default" https://docs.python.org/3/whatsnew/3.3.html




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: