Hacker News new | past | comments | ask | show | jobs | submit login
The False Allure of Hashing for Anonymization (gravitational.com)
200 points by twakefield on April 30, 2018 | hide | past | favorite | 101 comments



I saw a case a few years ago where the management of a company I knew were worried that the sales team were covering their mistakes and lying about it to blame the (external) dev team's code. They asked me to take a look into it one morning.

At first glance there didn't seem to be a lot to go on. There was no auditing in the application itself so I focused on the nginx logs. It's amazing how clear of a picture you can create from ip addresses, user agent strings and accessed urls.

Within an hour I could say with a high degree of certainty that the story was something like:

    Sales rep makes mistake with record on Friday afternoon
    Monday morning - at home, late for work
    Receives call from another rep re mistake
    Logs in via mobile device to see the issue
    Logs in via desktop to fix broken record
    Arrives at work 1.5 hours later
    Claims dev team had broken the record for the weekend
There's a lot of information lurking in log files (let alone insecure dbs), and that's just the tip of the iceberg of what's stored these days. I dread to think how much personal information is stored in some of the bigger CRM apps these days.

Quite frankly I'm glad there's a push to start thinking about this stuff from the outset at the moment.


I think the bigger problem was the culture that company had in place that would lead people to do that.


Sometimes it's just what the people bring with them, even if the company has "good" culture (whatever that means).

And even in companies with the best culture, I would expect such things to happen if the cost of a mistake is comparable to a person's yearly salary or above that.


Absolutely! Then again, no more broken than a number of other places I’ve worked over the years. I could write a book on that subject :-)


In digital security there is the concept of "defense in depth", that no one product, feature, approach or safeguard is going to magically make you protected from attacks. What's required are multiple overlapping layers of protection that collectively work together to create a more protected whole.

We're seeing more of this with privacy and user data. The author very correctly points out some issues with hashing and "pure" anonymization. It's more correctly considered "pseudonymization" (which is a recommended GDPR technique [1]).

All of which is to say _it's still an improvement over nothing_ and when layered with other techniques can help protect user privacy.

1 - https://blog.varonis.com/gdpr-requirements-list-in-plain-eng...


Defense in depth is a lesson other industries have learned - that's why airliners are incredibly safe these days. It's not safe because parts don't fail - they do fail, as the recent engine compressor failure showed. But the airliner is designed to withstand those failures, the pilot is trained to deal with them, and the process is designed to prevent them from happening again.

Notably the Fukushima Nuke plant and Deepwater Horizon disasters did not have defense in depth. One failure each had a zipper effect.

(Of course, defense in depth is a concept from the military, look how medieval castles are constructed for a very visible implementation of it.)


Author Here.

Using crypto hashes to anonymize data is one of those mistakes I've seen several times, and wanted to draw some attention to the issue so that hopefully we can all learn from it.

Let me know if you have any questions.


So one of the issues here is using an externally visible ID (or a transformation of such) as an internal ID. Why not create a random int64 at account creation time which is invisibly linked to the public username (eg, email address). So now you've got a proper join key, you can restrict access to the map, and it's easy to delete the map entry when the user unsubscribes.

(There can still be good reasons to apply one-way hashing to the random internal UUID, as well: for example, to provide different levels of logs access to different internal users. People who make dashboards get hashed ids, and people who debug logging get raw ids.)

The problem of entropy allowing individual user identification even with all IDs scrubbed is still very real, though, and non-trivial to undertake. One can start by wrapping the query engine with a service which checks that a certain minimum number of people are covered by a given query before returning the results. Or apply differential privacy-type transformations to the output...


This is along the lines of where I was going with the alternative approach, I just simplified it for brevity. :)

In the case of teleport, I think this is a bit more difficult to achieve, because we don't necessarily have our own account database, our common commercial use case is integrated to an identity provider through SAML/OIDC, which I'm not sure would consistently offer a random id per account to use.

While there are many way's we could generate and store the username <-> random id mappings, this adds a certain amount of complexity to get right on a distributed system.

If building a system from scratch with end to end control, I do prefer the random identifier approach.


Then user emails you to ask what personal data of his you have on the server. Now you don't have a connection so you can't find it, but you have it. GDPR non compliance.


It's your mapping, so you can easily gather up everything with the given marker and hand it back to them. You only throw away the key (and delete attached data) if the user deletes their account (and maybe after some additional time elapses, in case they change their mind or were hacked); it's the same process as GDPR per-user encryption key deletion.


If you throw away the key you still have the data but encrypted. There is no guarantee that in 5 years user data could be easily decrypted.


But there's no reason to believe that will be possible either. By that same reasoning it might be possible 'in 5 years' to recover the erased (and overwritten) data from the storage device, so you never can delete anything.

If you use something such as AES 256, which is approved for use to encrypt 'top secret' information by the NSA, and through some miracle it turns out that we can easily decrypt such data in 5 years, then I'm pretty sure you can argue in court that you were following best practices and had no reasonable way of predicting this encryption disaster.


'Key' here refers to the key in the mapping from external to internal userID. The whole point is that (as mentioned in a sibling comment) choosing an internal user ID uniformly at random is equivalent to a one-time pad; it's guaranteed non-decryptable, unless you invent a time machine...


Isn't there a distinction here, though? While they might result in a similar outcome, deletion is different from de-identification.


Well, the NSA slurps all Internet traffic, so by that definition, no encrypted communication is possible.


If you can't connect it to the user in any way, it's no longer personal information. Expect the data protection agency to compliment you.


> If you can't connect it to the user in any way, it's no longer personal information

Just because you can't connect it doesn't mean nobody else can.


http://www.privacy-regulation.eu/en/r26.htm

... account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable. ...

It's sufficient if one can't reasonably reconnect the data back to the user. It doesn't need to be NSA-proof.


It doesn't say that information cannot be _reasonably_ reconnected, but that you shouldn't be able to reconnect it at all.

I don't know how you have drawn that it shouldn't be NSA-proof from this text if it literally says "in such a manner that the data subject is not or no longer identifiable."


Its in the original link, I may have limited the quote too much:

... To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used ...


Thanks for the quote. Wow. I wonder if this odd definition doesn't render "unidentifiable" to mean "almost certainly identifiable by someone, with a current technique" - since, given enough techniques, most of them will be statistically unusual. I admit it's a start, but mangling semantics that baldly gives me the willies.

The parallel history of cryptography is little more than a history of overconfidence re what counters were thought to be likely, and not. Do we really need to recapitulate that?


For all practical purposes, a secure, one-way cryptographic hash is irreversible.


I'm thinking of a number between 1 and 100.

It's bcrypt hash is: '$2b$15$qUxzZ5ZF55lMuqiH9GMjQOHkNyee86qd2Vh2kQyF5P3U6JZJx9AEC'

I bet nobody could ever reverse this secure cryptographic hash to figure out what it could be... ;)


I think you need to address converse the examples in the article in order to assert this.


Isn't this essentially the uuid() function that many databases support natively (even the black sheep MySQL)?


Have you looked into VRFs (verifiable random functions)?

This is essentially a public key crypto version of HMACs. You have a piece of data and a blinding factor, both are committed into a single point on an elliptic curve, and blinding factor is not exposed as-is (as in case with salt for hashed content), but instead committed into a public key (a point).

The resulting point preserves the algebraic structure, which allows you to sign proofs about it, but w/o leaking the blinding factor. You can even sign a proof to a "designated verifier", which makes the proof useful only to a specific party, but anyone else wouldn't be able to trust it as it could be forged by that other party.


No I haven't, this sounds very intriguing though, I'll have to do some research.


Technically a substitution lookup table (like is proposed at the end) is analogous to one-time-pad 'encryption'. In this case the 'pad' is only used within a single (extended time domain) context and is presumably selectively exposed and used only in a contexts where intercepting the context already reveals that data anyway.

Additional security could be added by making a session-unique identifier (not based on user, chronological, or external context data) and only having the master lookup table for user to sessions in an elevated security environment.


Yeah, it sounds a lot like deterministic encryption. Which is OK if your input data is essentially uniform random.

But if it's not, and your adversary knows the distribution of the input data, then the protection level is pretty close to zero.


I think this is the best advice:

“And most importantly, we only collect enough data to fulfill our stated purpose. The fewer data points that we collect, the less opportunity that someone can correlate the data.”

The smaller the domain, the less anonymization works to conceal which user id did an action. However, if you think about it, identity is far more than user id. That is why real anonymity means not storing any unnecessary information from other domains. We have a technique where we use iframes to display a person’s name, friends etc. back to them based on user ids, but the enclosing domain knows only the user ids and their connections.


Please elaborate on whether hashing is a good pseudonimization strategy in context of GDPR guidelines.


I thought the blog post made it pretty clear that it isn't.


The blog post wasnt about

> pseudonimization strategy in context of GDPR guidelines.

i.e. Doing the minimal work to meet the guidelines. Effectiveness is incidental to that goal. Maybe I misunderstood the GP tho.


This is a question that I've thought of recently, as I am going to be working with a set of data that is the kind of data that may have damaging personal repercussions if identified with you but is good for society as a whole to be tracking, but that tracking doesn't have to be personally identifiable. Something like, it could be bad for me if it was revealed to my insurance company that I drove more than 5000 miles a year on a motorcycle, but beneficial for society as a whole to understand accident rates for high mileage motorcycle drivers. Do you have any thoughts/resources on how one could go about creating a privacy environment where users could input how many miles they drove, and where we have reporting that analyzes that information they put in? My first thought had been hashing primary keys, but as you point out in your article, that obviously isn't the best answer.


Differential privacy and other formalized systems are a good choice, but if you never need to give the data back or present it as-such to the customer/inputer, you can get heuristic Pretty Good Anonymization if you understand the structure of your problem and how you're going to use it.

For example taking your example of motor vehicle trips off the top of my head, in order the things that can ID you are:

  Driver's License
  Name
  Vehicle License Plate
  Time, Location of trip
  Trip Distance
  Location of driver residence
  Location of driver workplace
If you had a database of these things, you could apply some of the strategies in the article, and a few others to ensure no collisions.

  Driver's License: Ditch it, 
  hash it with private key or have a lookup table
  somewhere. I'd favor ditching it.
  Name: Same as DL number
  Vehicle License Plate: Same as DL number
For the above 3, you really may only need a few variables that are less constrained: gender, approximate age, type of vehicle so you could just compute out to those and store only that result.

  Time, Location of trip: Fudge these +- random time, or +- random distance from start/finish. 
  Careful not to have it be a dumb random circle, Strava does this, given enough public rides I'm sure people 
  could figure out where I live. (maybe do this as function of population density?)
  Trip Distance: Fudge +- random distance
  Location of driver residence: Fudge to begin with, probably ditch if possible
  Location of driver workplace: Ditto
The point is think about what you need from the dataset and deliberately mess it up so that you'd have to have the original to piece it together. Often, you don't need the exact input data, but something within a random delta of it, so just keep the stuff within a random delta.


But if I do need the original data back, say, the driver needs to produce an expense report with the hours, what would you do in that case? I have thoughts, but trying to bounce off of someone else.


If you need to provide the data back to the customer, then maybe the right answer is to follow the same standards as financial institutions and health companies do. In practice, that comes down to ensuring that no individual has access to the underlying data without extreme monitoring of how that data moves around and is used. This is a rather large burden though, so I can understand if that's too much for your use case.

Things we do:

  - Rotate passwords used to access networks/servers regularly
  - 2FA all the things
  - Only provide permissions to what a user needs
  - Limit it to just time a user needs it 
  - Logging+security scanning across the backend infrastructure
  - Tight monitoring of devices used to access network for patch level
  - Keep front-end networking infrastructure redundant and patched
  - Multiple levels of auth (vpn pw, vpn 2FA, then public/private key for each server, then 2FA for each server, etc.)
You can only do so much but you can make it so that it's harder to compromise the crown jewels.


That makes sense. The data set is going to be in the health area, and I'm less concerned about processes for the individuals in the organization having access (like what you've suggested) and more thinking about how to structure the data so we as an organization can't access it. Dealing with infectious disease, where there is personal benefit to not letting someone outside the care side know that you have a disease, but societal benefit to tracking trends, outbreaks, or hygiene around the disease. And figuring out how to structure the system so that if we were to sell, say, there wouldn't be this trove of information on who has what diseases, just who was a customer.

Thanks for your thoughts!


Store the delta's + the identifying information somewhere else as a lookup table and use a random ID to join to it. Keep the PII database secured, offline, or whatever makes you feel best, and then if anyone needs direct correlation back to the end user, it is done through a different process that ensures higher access controls/auditing, etc.


SHA256 pretty much ensures that you have a unique hash for every value - and that's a feature you don't want for anonymization. So why not simply take the first few bytes of a SHA256, a small enough set to ensure that collisions not only might happen but will happen? I mean, that's a required feature to ensure anonymization, not just pseudonymization - if you can select a whole trail of events for ID #123 and be sure that these represent all the events for some (unknown) real user, then that by itself means that those events aren't anonymous, they're pseudonymous.

You can tweak the hash length so that whatever statistics you run out of the hashed data are meaningful (though not exact) despite the collisions, but that running a dictionary attack of plausible usernames returns an overwhelming amount of false positives.


I'm not sure you could make the data statistically meaningful and have too many false positives to deanonymize an id. I think you're basically suggesting randomly grouping the ids so they average X real ids per grouped ID. At least if you just did it randomly instead of by hashing then there would be no danger of a dictionary attack.


The expectation is that a brute force attack would try orders of magnitude more IDs than you actually have. It means that if a random ID is 90% likely to have a unique hash and 10% likely to map to one of your real IDs, then your real data won't have that many collisions, however, if someone does a brute force check of (for example) a million email addresses, then they'll get 100 000 positive responses, the vast majority of which will be false positives.


That's a reasonable point but doesn't explain why you're using hashes instead of random groupings in the first place.


The idea that data is a corporate asset has to die. Data is a corporate liability.


i agree that data is a liability, however even my plumber's truck can kill someone and can be considered a liability. regardless, the truck is not something he can do business without. I agree that companies should fear the data they retain much more than they do today.


To be fair, companies gather a lot more data than they need to to do business these days.


Why?



Like a million cogs in a warehouse, it costs money to store customer data. Storage costs can quickly overwhelm the value of the stored product. These costs can be lower for improper storage - not wrapping cogs can lead them to rust, and not encrypting data can lead it to being stolen. Financial documents should reflect the risks of storing data, to communicate this liability to shareholders and others.

"We are storing PII (Personally Identifiable Information) on 100 million Americans. A data leak could lead to significant material damages, from settlements to an impaired public reputation."


Differential privacy seems like a pretty good approach to this problem. https://machinelearning.apple.com/2017/12/06/learning-with-p...


K-anonymity is the other thing i've come across:

https://www.privitar.com/listing/k-anonymity-an-introduction


Differential privacy is basically a buzzword. Don't believe the hype.


It seems to me that diffpriv is a nascent area of research that has not yet been bastardized by the business community. The complete opposite of a buzzword.


I am not a crypto expert, but I thought that the idea was to produce a new more or less random salt for EACH password, store the salt with the hashed password, hashing using an expensive algorithm. Yes the hacker steals the salt with the hash, but now has to go to the trouble of brute forcing that ONE password with its UNIQUE (or almost unique) salt. In other words, the hacker can crack it, but the process is so expensive for ONE password that cracking an entire database of passwords is a nightmare. Of course, the hacker just focuses on the most privileged accounts I guess, but the idea is to make the hackers life as unpleasant as possible, and to catch the hacker while they are coming back in. Am I missing the point? I do see that if the hacker wants one password, they can do with effort even with unique salts.


this had me confused at first too, but I think the author's point is that if the initial data comes in a predictable form (e.g. an IP address that is x.x.x.x where x is 0-254, email addresses that are mostly short and ends in "gmail.com," etc), salts don't really save the hash from being brute forced, they just save the hash from being brute forced with a rainbow table. the author's post isn't about passwords, per se, but how the kind of datapoints we often hash for the sake of anonymization are really only pseudo-anonymous, or at least a lot weaker than people might expect for a string of x length.

that said, bcrypt, PBKDF2, and other time/work-based hashing solutions are still very good options for this.


For the specific use case in question, what I've been doing for years is not just hashing the data, but hashing an internal secret AND the data. The secret isn't stored in the database anywhere (usually an env var but could be a secret in vault or other outside config), so our hashes are deterministic (and don't need a seperate salt for each one), but our hashes will never cooincide with another system's hashes. I didn't see this mentioned in article but didn't read thoroughly, I thought this was a pretty good compromise but curious for other perspectives/forget if I read this technique somewhere or just made it up as a reasonably good safeguard.


> hashing an internal secret AND the data

This is essentially how HMAC works (HMAC is mentioned in the article). It's generally considered safer to use a 'real' HMAC algorithm instead of rolling your own.


This is called a pepper:

https://en.wikipedia.org/wiki/Pepper_(cryptography)

If you read the article above, you'll see that you still need a salt, since users with very simple passwords will have the same hash: crack one, and you can crack the others for free.


Yeah I should have been clear this wasn't for passwords at all, ever, this was/is only for other kinds of PII


This is addressed in the article under "What if we make the hash slow?" You might be tempted to use salted hashes, but apparently this only works for protecting data that is supposed to be unpredictable, like passwords, and it's not too much of a setback if the data is easily predictable, like if it's a username or email (or presumably, an IP address or a timestamp.)

> Even with something like bcrypt at reasonable work factors, a database of 100,000 anonymous users would take less than a day on a single cpu core to test every bcrypted entry for the string “knisbet” and unmask my secret data.


Would someone explain to me why this is true (or not).


OP Here.

With my understanding of bcrypt, it's an algorithm that's designed to be slow (and the implementation providing guarantee's to resist attempts to significantly speed it up), and the slowness is tuneable through a work factor. So usually you would target something like 200-500ms. Long enough to be slow if you have to make billions of guesses, but still fast enough that when someone enters their correct password they're not sitting around waiting for you're login to complete.

If my database is something like:

  salt1 + alex = aaaaaaa
  salt2 + ben = bbbbbbbb
  salt3 + kevin = ccccccc
* with 100,000 entries

This means if I want to find user = kevin in this database, I take the salt + username, and test if it produces a match in the database. Once I get to salt3 + kevin, and produce cccccc, and see that cccccc is in the database, I've now unmasked that user.

At 500ms, a single CPU can test 172,800 entries per day, which could easily scan the database. If it's millions, or tens of millions of users, you do need more resources, and if you want to unmask every user it will take some time, but it becomes plausible for a moderately sophisticated adversary with moderate resources.

The algorithms like bcrypt, scrypt etc are great for passwords, where depending on the password used, you need to make billions of guesses. However, if you reduce the problem space down to millions or thousands of guesses, because you already know the username, the email address, or the date of birth because this is not secret information but public information, these algorithms being slow helps significantly, but not enough to work alone.


I see, so assuming the salt is known (which I guess is a reasonable assumption in a large fraction of cases).


Yes, strictly speaking a salt is not a secret, and would generally be stored with the data you are salting.

If you change the semantics and make the salt a secret that is stored separately, it does make this difficult to attack, but the advice I was given is it would be better to use hmac, which is already designed to work this way based on storing a secret.


Okay, that's what I thought. Under my current use-case, I think what I'm doing is quite adequate. But, your post is very relevant to what I think I'm going to need to do at some point soon, so thank you!


Engineering happened somewhere along the way. Something was too slow, or couldn't be finished that hour/day/sprint and compromises were made. Then those compromises were shipped.


I'm surprised that there was no mention of a salt used in a secure server to generate the hashes and act as an oracle. Adding pepper at the customer site already seemed like a good idea. Of course this is still hard and requires diligence for those who care about their customers and data security.


How would you prove that said server is secure?

A good rule of thumb with these things is to assume that if there's any sort of indirect link between some person and that server (even if it involves multiple hops across security boundaries - e.g a web request invoking a backend service querying a database that accesses the hash from a stored procedure), it can potentially be compromised. You never know when another Meltdown happens, and what it'll look like.


The trouble is when we're holding on to the original data because we want the option to process it in new ways later on. The fundamental problem is that data correlates facts. Thus - as the article rightly points out - if you know some of the facts you can reconstruct identities.

I find the distinction between information and exformation revealing: Information is the bits we gleaned from the data, exformation is the bits we discarded while reducing the data. The efficacy of an information processing system is in how much it discards while extracting the information we need. The expensive operation is not the recording but the forgetting.

If you want to protect data from being stolen, distill it as soon as possible into the information you need. And destroy the rest. It comes down to the value of being able to re-run the analysis versus the effort to guard the data.


As someone who's had to work through the implications of GDPR lately, I think the future of user data is that you can't keep the option to "process it in new ways" later. Permissions are becoming opt-in instead of opt-out.


You probably can, but you need to be upfront about what you're collecting and the context that is being stored with it.

It MAY be more ethically permissible to degrade the context and preserve only the most valuable and least personally identifying data. (Such as saving only the actual search query and a local timestamp, but filtering out anything related to a recognized name that isn't famous)


"Anonymization" in the sense of transforming a dataset so that it's still useful but doesn't significantly reduce the privacy of the people it describes, is usually impossible, or at least beyond the state of the art. People start out with just a few tens of bits of anonymity and bits are everywhere.

You probably have a better chance of creating your own secure block cipher than of achieving this goal. In a similar way, your inability to see what's wrong with your scheme is not evidence that it works.

I don't like to be negative, and I'm all for continued research, but at this point the conservative thing to do with data that you need to "anonymize" is delete it.


Agreed. The more alarming angle to consider is that the more a particular describes somebody, 1) The more valuable it is in the context of surveillance and advertising, 2) The more work good-faith actors should put into anonymising it, and most importantly, 3) The easier it is to de-anonymise through correlation with other sets.

~~People just aren't the unique snowflakes our mothers told us we are.~~ Most people for example can be uniquely (and easily) identified with just a DOB, first name, and suburb.

Edit: maybe the problem is actually that we are too unique :)


The author makes a good point, anonymizing data is hard. Unfortunately they don't mention differential privacy, a promising area of research that can help us solve these problems.

https://en.wikipedia.org/wiki/Differential_privacy


Where I work we've been debating about this a lot. I work with log data from CDNs, so user IP addresses get ingested. We use that information and correlate it with geoip services to determine stuff like the ISP being used.

This is so we can evaluate CDN performance and also see how well ISPs are doing in serving content to the user. So it's essentially asking questions about network performance rather than at a macro level of individual users.

As far as IPs are concerned we don't care much after that, other than maybe the odd "how many unique IP addresses were served today" type queries.

We've talked about doing the secret/salt that is rotated periodically, but to be safe you would definitely need to ensure previous salts are destroyed, and not even let people view them or access them when they are live.


Wouldn't storing the first three octets of an IP address be enough for this kind of analysis? Or use the whois database and reduce the data to the first IP address of the network ?


I personally think just storing the autonomous system the IP originates from and never writing the IPs to disk at all would be advisable if the goal is purely which ISPs are delivering how many bytes to end users. Another benefit is the AS to IP mapping database is small enough to fit in memory without issue.


That's probably insufficient for the usecase. A single AS can advertise many different routes for different IP blocks that have dramatic geographic differences.


How are you going to ask user for consent to process their IP this way?


Consent is not the only basis for legally processing data. There is not enough information in the above comment to determine which basis this company has determined their processing falls under.


When addressing the solution of adding data (salt) I find the authors counter-argument unconvincing:

   Don’t get me wrong, this does make it significantly harder 
   to attack a leaked database to unmask every user, but the 
   resources required to do so or target specific users are 
   within the reach of many adversaries.
I don't see how it's more feasible to reverse hash(known_user+salt) than it is to dereference hash(salt), and even state level actors can't do anything but attempt to brute-force hash(salt). IOW without more behind the author's assertion, I don't buy it that adding more data to the data you want to protect is insufficient protection, even against known targets.


The link to Cryptographic Right Answers is really helpful and the kind of article that it would be nice to make the general "go-to" for those of us who know enough but not enough to do it ourselves!

What I didn't like was the continual reference to AWS as if it is the only provider available, without qualifying whether it is specifically an AWS product that solves the problem or whether it is an example of using a cloud service to transfer the risk. There are many alternatives to AWS load balancers and Key Management systems, so the advice is tainted sigh


Why not a two-step process, where you (A) generate a hash from fixed user details and (B) use that hash to access a lookup-table for the final UUID? This combines some strengths of both systems:

1. Outsiders can't determine an arbitrary UUID, even if they know the original user-details.

2. You can easily destroy a relationship (to limit correlation or to comply with laws like GDPR) by erasing the corresponding row in the lookup table.

3. Insiders can't directly go backwards from UUID to real-name, due to the hashing step. They would need to generate hashes for all the users, and hope that matches still exist in the lookup table.


Surprising that no mention is made of rainbow tables or lookup tables. If you hash something that can easily be looked up in a table, it's obviously not anonymous.

Passwords are stored as salted hashes for these obvious reasons...


The article explains very well how salted hashes don't help against username lookups.


In the case of salts, the article admits "Don’t get me wrong, this does make it significantly harder to attack a leaked database to unmask every user..."

So salts definitely do help. And if you chose your salt well (e.g. global fixed/rotating plus local/temporal) you significantly increase your protection compared to not using a salt at all.


Really good article. One of those things that are beyond obvious to those of us close to this field, but not at all obvious to the general software dev (who also mightn't know the difference between a good cryptographic hash and a good password hash).

What would make this article great is general ideas on what is a good way to anonymize data. I'm surprised that info is missing, actually.

What would make it world class great is discussion about GDPR ramifications, keeping in mind that one need not necessarily be perfect for GDPR, even if you're FB/Google.


Thanks for the feedback.

I was trying to avoid the general ideas on what is a good way to anonymize data, because I don't think there are general rules that apply, and I'm not in a position to give authoritative advice on this. The more I dug in, the more I realized this is probably one of the hardest technical problems that exists right now, and there isn't yet a right answer that works (like use scrypt for passwords).

As for GDPR, I think digging into this in more detail would be a great follow up.


everyone is going to have different requirements so yeah, hard to claim there is a general solution. but an idea or 2 can be thrown out there. like an anonymizer microservice that only remembers the mapping for a limited time period. even stating explicitly that it’s a hard problem and very very hard problem if you want perfection, would be a worthwhile addition. as it stands, the article doesn’t convey the difficulty of addressing the problem.


If our goal is true anonymisation, that is, even the host cannot know who the data belongs to, why are we hashing data at all, and not completely removing it? Replace the pii (name, email address, phone etc) with a fixed number of *'s. There's no reversing or guessing that.

If we are wanting information to be readable by some people in some circumstances, that's not anonymisation: that's data protection and an entirely different problem.


I think another problem is that we even call any of that "anonymization". If you replace "foobar" with "1", you haven't anonymized anything. At best, you have pseudomized your data. Whether you use hashing or a secret mapping function, as long as identity within your dataset is preserved, what you are generating are pseudonyms.


> The way we’ve chosen to anonymize the data is by generating HMAC

You can also truncate the hash after the HMAC to mix the data of different users. It still would be useful for aggregate analytics, abuse protection, rate limiting, etc, but if each user shares an identifier with many others it would be harder to unmask them and make correlations.



See also, why hashing is not a good way to discover shared contacts, and a better way: https://signal.org/blog/contact-discovery/


Any thoughts on using a UUID instead of a username hash?


A hash is (theoretically) an anonymous function.

How do you anonymously map your UUID to the anonymised signifier? Such that you cannot back it out yourself?

The properties that ensure this make the UUID useless.


tl;dr preimage resistance is only as strong as sizeof(input domain), which is probably small if you're trying to anonymize something.


If you don’t require deterministic hashes (and deterministic hashes are bad for anonymization anyway) just hash data+randomBytes(16) (obviously, don't save randomBytes(16) anywhere). There you are, nobody can bruteforce your hashes.

Even better, just replace your data with H(randomBytes(16)). Or a random UUID.


Umm what good is the string of random bytes if you don’t store it anywhere? The point of a one-way function is that its output is verifiable given the input.


either your comment is nonsensical or you left a detail out. per your comment, you may as well not record the data at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: