Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Great write-up! We did something very similar when trying to find duplicate product images for a consumer review site we were working on. Our implementation desaturated the image, broke it into a fixed number of tiles, and generated a histogram of each tile's values. Then we put a threshold on the histogram values, and had each value in the tile represent a bit. Combine the bits, and we had a hash to store in the DB. Our hashes were a bit larger, so images within a certain hamming distance were flagged, rather than just looking for exact hash matches. It took quite a bit of tuning for us to get good results, but it seemed to work pretty well. Do you see many false positives with such a small processed image size (the 9x8 one, I mean)?


In practice we are using an image size of 17x16 which will result in a hash size of 256 bits and currently it seems to work pretty well. I ran the algorithm through the whole dataset (about 330.000+ icons) and I would say that from all the duplicate matches about 1% where false positives.

Also, we will be integrating this into the reviewing process for an iconset, where we also do a manual quality check, showing possible matches to something currently uploaded so skimming over one or two false positives isn't such a big deal and we where more interested in the speed of the algorithm.


That's pretty impressive performance given the hash size and speed. Thanks for sharing!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: