I am not a crypto expert, but I thought that the idea was to produce a new more or less random salt for EACH password, store the salt with the hashed password, hashing using an expensive algorithm.
Yes the hacker steals the salt with the hash, but now has to go to the trouble of brute forcing that ONE password with its UNIQUE (or almost unique) salt.
In other words, the hacker can crack it, but the process is so expensive for ONE password that cracking an entire database of passwords is a nightmare.
Of course, the hacker just focuses on the most privileged accounts I guess, but the idea is to make the hackers life as unpleasant as possible, and to catch the hacker while they are coming back in. Am I missing the point? I do see that if the hacker wants one password, they can do with effort even with unique salts.
this had me confused at first too, but I think the author's point is that if the initial data comes in a predictable form (e.g. an IP address that is x.x.x.x where x is 0-254, email addresses that are mostly short and ends in "gmail.com," etc), salts don't really save the hash from being brute forced, they just save the hash from being brute forced with a rainbow table. the author's post isn't about passwords, per se, but how the kind of datapoints we often hash for the sake of anonymization are really only pseudo-anonymous, or at least a lot weaker than people might expect for a string of x length.
that said, bcrypt, PBKDF2, and other time/work-based hashing solutions are still very good options for this.
For the specific use case in question, what I've been doing for years is not just hashing the data, but hashing an internal secret AND the data. The secret isn't stored in the database anywhere (usually an env var but could be a secret in vault or other outside config), so our hashes are deterministic (and don't need a seperate salt for each one), but our hashes will never cooincide with another system's hashes. I didn't see this mentioned in article but didn't read thoroughly, I thought this was a pretty good compromise but curious for other perspectives/forget if I read this technique somewhere or just made it up as a reasonably good safeguard.
This is essentially how HMAC works (HMAC is mentioned in the article). It's generally considered safer to use a 'real' HMAC algorithm instead of rolling your own.
If you read the article above, you'll see that you still need a salt, since users with very simple passwords will have the same hash: crack one, and you can crack the others for free.
This is addressed in the article under "What if we make the hash slow?" You might be tempted to use salted hashes, but apparently this only works for protecting data that is supposed to be unpredictable, like passwords, and it's not too much of a setback if the data is easily predictable, like if it's a username or email (or presumably, an IP address or a timestamp.)
> Even with something like bcrypt at reasonable work factors, a database of 100,000 anonymous users would take less than a day on a single cpu core to test every bcrypted entry for the string “knisbet” and unmask my secret data.
With my understanding of bcrypt, it's an algorithm that's designed to be slow (and the implementation providing guarantee's to resist attempts to significantly speed it up), and the slowness is tuneable through a work factor. So usually you would target something like 200-500ms. Long enough to be slow if you have to make billions of guesses, but still fast enough that when someone enters their correct password they're not sitting around waiting for you're login to complete.
If my database is something like:
salt1 + alex = aaaaaaa
salt2 + ben = bbbbbbbb
salt3 + kevin = ccccccc
* with 100,000 entries
This means if I want to find user = kevin in this database, I take the salt + username, and test if it produces a match in the database. Once I get to salt3 + kevin, and produce cccccc, and see that cccccc is in the database, I've now unmasked that user.
At 500ms, a single CPU can test 172,800 entries per day, which could easily scan the database. If it's millions, or tens of millions of users, you do need more resources, and if you want to unmask every user it will take some time, but it becomes plausible for a moderately sophisticated adversary with moderate resources.
The algorithms like bcrypt, scrypt etc are great for passwords, where depending on the password used, you need to make billions of guesses. However, if you reduce the problem space down to millions or thousands of guesses, because you already know the username, the email address, or the date of birth because this is not secret information but public information, these algorithms being slow helps significantly, but not enough to work alone.
Yes, strictly speaking a salt is not a secret, and would generally be stored with the data you are salting.
If you change the semantics and make the salt a secret that is stored separately, it does make this difficult to attack, but the advice I was given is it would be better to use hmac, which is already designed to work this way based on storing a secret.
Okay, that's what I thought. Under my current use-case, I think what I'm doing is quite adequate. But, your post is very relevant to what I think I'm going to need to do at some point soon, so thank you!
Engineering happened somewhere along the way. Something was too slow, or couldn't be finished that hour/day/sprint and compromises were made. Then those compromises were shipped.