I personally find it amazing that everything we have made - every photograph, every book, every movie, and everything we will ever make can be given a unique 256 bit fingerprint that has a negligible chance of collision before the heat death of the universe.
an interesting comparison is 2^256 ~= 1e77 ~= 0.001 x the number of atoms in the universe. So in 256 bits we almost have enough combinations to assign every particle _there is_. Considering we've made considerably fewer works than there are particles, you could probably actually get pretty far along collision resistance by just assigning fingerprints randomly. Another way to look at that is UUIDs are only 128 bits (sqrt as many combinations, 1e38) and they're pretty resistant. Another amazement is that we can have a deterministic hashing function to select a fingerprint _based on the content_ in a way that preserves collision resistance.
For a similar thought experiment, consider the space of all 1920x1080 RGB images , which necessarily contains all films ever made or ever _could_ be made, screenshots of future stock prices (including all the wrong ones), and pictures of your great great great grandchildren (or not). Then figure that randomly sampling that space just gives white noise – suggesting our 'content' probably isn't a large fraction of the total space. :)
My ISP hands out IPv6 /48 subnets. The upside of this is that every customer of theirs gets enough IP addresses to give every grain of sand on the planet a couple.
Yes. Even with 128 bits (uuid), I like that the policy is simply to treat each one as universally unique without needing to check any system for collision
As Raymond Chen put it, if you're worrying about UUID collisions, you need to also worry about way more probably things, such as random memory bit flips.
I still frequently see MD5 and SHA1 being used "because the output is smaller than other hash functions and we only need a unique identifier". There's also a belief that an implication is that these functions are faster.
While a 128-bit output is indeed perfectly fine for most applications, MD5 and SHA1, in addition to being affected by practical attacks, are slow compared to SHA256, BLAKE, etc. Most importantly, it's perfectly fine to truncate the output of a cryptographic hash function to any length. So, if you need a 128-bit hash, just use SHA256 and truncate the output to 128 bits. This is faster than MD5 and more secure (even against length extension attacks).
I wonder if you could construct a hash collision for high pagerank sites in the google (or Bing) index. You would need to know what hash algorithm google uses to store URLs. This is assuming that they hash the URLs for their indexing. Which surely they do. MD5 and SHA1 existed when google was founded, but hash collisions weren't a big concern until later IIRC. You'd want a fast algorithm because you're having to run your hashing algorithm on every URL you encounter on every page, and that adds up quickly.
The max legal length of URLS is 2048, but I wouldn't be surprised if there aren't plenty of non-compliant URLs longer than that in the wild. If you were limited to 2048 characters, and a valid URL format, I suspect it would be hard if not impossible to build a URL with the same MD5 of an arbitrary high ranking URL like "https://nytimes.com/" But what if you just wanted to piggy back the pagerank of any mid to high rank site? Is there a URL in the top million ranked URLs you could MD5 hash collide?
I doubt google would use a URL hash as strong and as slow as MD5. Maybe Page and Brin weren't even thinking about cryptographic hashes, and just a good mixing function.
Google has developed several non-cryptographic hash functions. CityHash, FarmHash, and HighwayHash come to mind. BigQuery provides FarmHash next to the MD5 and SHA-family of cryptographic hash functions. Who knows, maybe they use FarmHash today to index page URLs.
I don't remember the details but I think there was a legend at some point of two different search queries hashing to the same 64-bit integer once and causing problems.
> Is there a URL in the top million ranked URLs you could MD5 hash collide?
This is answered in the article. No
"Q: Is it possible to make a file get an arbitrary MD2/MD4/MD5/MD6/SHA1/SHA2/SHA3, or the same hash as another file? A: No."
I'm not sure page ranking works that way.
But you could get people to share www.nytimes.com/2011/05/02/world/asia/osama-bin-laden-is-killed.html?garbage=blahblah and collision it with your site crappyproduct.com?rubbish=meh