I am not a crypto expert, but I thought that the idea was to produce a new more ...

tribby · on May 1, 2018

this had me confused at first too, but I think the author's point is that if the initial data comes in a predictable form (e.g. an IP address that is x.x.x.x where x is 0-254, email addresses that are mostly short and ends in "gmail.com," etc), salts don't really save the hash from being brute forced, they just save the hash from being brute forced with a rainbow table. the author's post isn't about passwords, per se, but how the kind of datapoints we often hash for the sake of anonymization are really only pseudo-anonymous, or at least a lot weaker than people might expect for a string of x length.

that said, bcrypt, PBKDF2, and other time/work-based hashing solutions are still very good options for this.

bglusman · on May 1, 2018

For the specific use case in question, what I've been doing for years is not just hashing the data, but hashing an internal secret AND the data. The secret isn't stored in the database anywhere (usually an env var but could be a secret in vault or other outside config), so our hashes are deterministic (and don't need a seperate salt for each one), but our hashes will never cooincide with another system's hashes. I didn't see this mentioned in article but didn't read thoroughly, I thought this was a pretty good compromise but curious for other perspectives/forget if I read this technique somewhere or just made it up as a reasonably good safeguard.

GordonS · on May 1, 2018

> hashing an internal secret AND the data

This is essentially how HMAC works (HMAC is mentioned in the article). It's generally considered safer to use a 'real' HMAC algorithm instead of rolling your own.

aflagatopamoon · on May 1, 2018

This is called a pepper:

https://en.wikipedia.org/wiki/Pepper_(cryptography)

If you read the article above, you'll see that you still need a salt, since users with very simple passwords will have the same hash: crack one, and you can crack the others for free.

bglusman · on May 11, 2018

Yeah I should have been clear this wasn't for passwords at all, ever, this was/is only for other kinds of PII

rossy · on May 1, 2018

This is addressed in the article under "What if we make the hash slow?" You might be tempted to use salted hashes, but apparently this only works for protecting data that is supposed to be unpredictable, like passwords, and it's not too much of a setback if the data is easily predictable, like if it's a username or email (or presumably, an IP address or a timestamp.)

> Even with something like bcrypt at reasonable work factors, a database of 100,000 anonymous users would take less than a day on a single cpu core to test every bcrypted entry for the string “knisbet” and unmask my secret data.

your-nanny · on May 1, 2018

Would someone explain to me why this is true (or not).

kevin_nisbet · on May 1, 2018

OP Here.

With my understanding of bcrypt, it's an algorithm that's designed to be slow (and the implementation providing guarantee's to resist attempts to significantly speed it up), and the slowness is tuneable through a work factor. So usually you would target something like 200-500ms. Long enough to be slow if you have to make billions of guesses, but still fast enough that when someone enters their correct password they're not sitting around waiting for you're login to complete.

If my database is something like:

  salt1 + alex = aaaaaaa
  salt2 + ben = bbbbbbbb
  salt3 + kevin = ccccccc

* with 100,000 entries

This means if I want to find user = kevin in this database, I take the salt + username, and test if it produces a match in the database. Once I get to salt3 + kevin, and produce cccccc, and see that cccccc is in the database, I've now unmasked that user.

At 500ms, a single CPU can test 172,800 entries per day, which could easily scan the database. If it's millions, or tens of millions of users, you do need more resources, and if you want to unmask every user it will take some time, but it becomes plausible for a moderately sophisticated adversary with moderate resources.

The algorithms like bcrypt, scrypt etc are great for passwords, where depending on the password used, you need to make billions of guesses. However, if you reduce the problem space down to millions or thousands of guesses, because you already know the username, the email address, or the date of birth because this is not secret information but public information, these algorithms being slow helps significantly, but not enough to work alone.

your-nanny · on May 1, 2018

I see, so assuming the salt is known (which I guess is a reasonable assumption in a large fraction of cases).

kevin_nisbet · on May 1, 2018

Yes, strictly speaking a salt is not a secret, and would generally be stored with the data you are salting.

If you change the semantics and make the salt a secret that is stored separately, it does make this difficult to attack, but the advice I was given is it would be better to use hmac, which is already designed to work this way based on storing a secret.

jonnycomputer · on May 1, 2018

Okay, that's what I thought. Under my current use-case, I think what I'm doing is quite adequate. But, your post is very relevant to what I think I'm going to need to do at some point soon, so thank you!

busterarm · on May 1, 2018

Engineering happened somewhere along the way. Something was too slow, or couldn't be finished that hour/day/sprint and compromises were made. Then those compromises were shipped.