Although this article is about collisions in the hash function applied to strings, for numeric values Python uses a
hash() function that is not only easy to generate collisions for but actually to invert (and find all inverses). I learned about this when writing this answer:
It's probably using the colloquial version of the word (which I hate). "Exponentially" = "hugely, but I want to sound technical." Or the still frustrating but more acceptable "exponentially" = "by orders of magnitude."
I mean the attack as a whole. I think having a linear lookup alone wouldn't be as problematic if it didn't turn the creation of the dictionary itself quadratic. The attack wouldn't scale near as well if it ran in O(n).
> Note: By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.
> This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.
> Changing hash values affects the iteration order of sets. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).
Yup, don't ever rely on any random hash function, even if widely used, to be repeatable between processes. I've seen a deployment of a small caching system that used the Python's built in hash function. The author of the code was wondering why does he get a ~zero hit rate (processes were short lived in that setup).
Related: iterating over hash maps is a very common way to get nondeterministic output. I've removed a lot of inconsistency bugs by fixing the hash map iteration order. Usually by just using the sorted map, since the performance of a randomly chosen piece of code in a large system is almost irrelevant.
It was actually fixed in Python 3.4 with the switch to use SipHash, as the randomization introduced in Python 3.3 was still susceptible to some attacks. See https://www.python.org/dev/peps/pep-0456/
Still not secure if you can read out the seed. Brute forcing SipHash collisions with the known seed is also just a matter of seconds to minutes.
http://perl11.org/blog/seed.html
A proper security fix can only be to fix the collision resolution, never using a slower hash function. It's also 10x faster then.
With dynamic languages like python it's trivial. Calculate the position and peek it.
With static languages it's still easy when you got enough information: source code, timing info and ordering (e. g JSON). With a proper SAT solver doable.
Leaking by accident is e.g even more trivial in perl, just set a magic ENV var. But peeking the fixed offset is easiest.
It's all just security theatre. There's no real use-case for something so slow as Siphash.
Do you often give DoS threats arbitrary code execution or the ability to set environment variables? Because there’s no need to bother with hash collisions to DoS yourself.
while True: pass
As for finding the seed with a SAT solver: no, not doable. Not from actual hashes, let alone timing and order hints at actual hashes.
Only for hash tables which use all bits. (e.g. prime sized).
Almost nobody does. For normal power-of-2 hash tables finding collisions in the lowest bits leading to a successful amplification is trivial by brute-force. Since Siphash is so slow it can last up to 4 minutes, for smaller tables still < 5s.
• https://stackoverflow.com/questions/56227419/why-does-python...
and details and code on how to invert are here:
• https://stackoverflow.com/a/56248241/4958
(Needless to say, this is quite straightforward and trivial compared to collisions on strings; nevertheless it may be of some interest to someone.)