Differential privacy and other formalized systems are a good choice, but if you never need to give the data back or present it as-such to the customer/inputer, you can get heuristic Pretty Good Anonymization if you understand the structure of your problem and how you're going to use it.
For example taking your example of motor vehicle trips off the top of my head, in order the things that can ID you are:
Driver's License
Name
Vehicle License Plate
Time, Location of trip
Trip Distance
Location of driver residence
Location of driver workplace
If you had a database of these things, you could apply some of the strategies in the article, and a few others to ensure no collisions.
Driver's License: Ditch it,
hash it with private key or have a lookup table
somewhere. I'd favor ditching it.
Name: Same as DL number
Vehicle License Plate: Same as DL number
For the above 3, you really may only need a few variables that are less constrained: gender, approximate age, type of vehicle so you could just compute out to those and store only that result.
Time, Location of trip: Fudge these +- random time, or +- random distance from start/finish.
Careful not to have it be a dumb random circle, Strava does this, given enough public rides I'm sure people
could figure out where I live. (maybe do this as function of population density?)
Trip Distance: Fudge +- random distance
Location of driver residence: Fudge to begin with, probably ditch if possible
Location of driver workplace: Ditto
The point is think about what you need from the dataset and deliberately mess it up so that you'd have to have the original to piece it together. Often, you don't need the exact input data, but something within a random delta of it, so just keep the stuff within a random delta.
But if I do need the original data back, say, the driver needs to produce an expense report with the hours, what would you do in that case? I have thoughts, but trying to bounce off of someone else.
If you need to provide the data back to the customer, then maybe the right answer is to follow the same standards as financial institutions and health companies do. In practice, that comes down to ensuring that no individual has access to the underlying data without extreme monitoring of how that data moves around and is used. This is a rather large burden though, so I can understand if that's too much for your use case.
Things we do:
- Rotate passwords used to access networks/servers regularly
- 2FA all the things
- Only provide permissions to what a user needs
- Limit it to just time a user needs it
- Logging+security scanning across the backend infrastructure
- Tight monitoring of devices used to access network for patch level
- Keep front-end networking infrastructure redundant and patched
- Multiple levels of auth (vpn pw, vpn 2FA, then public/private key for each server, then 2FA for each server, etc.)
You can only do so much but you can make it so that it's harder to compromise the crown jewels.
That makes sense. The data set is going to be in the health area, and I'm less concerned about processes for the individuals in the organization having access (like what you've suggested) and more thinking about how to structure the data so we as an organization can't access it. Dealing with infectious disease, where there is personal benefit to not letting someone outside the care side know that you have a disease, but societal benefit to tracking trends, outbreaks, or hygiene around the disease. And figuring out how to structure the system so that if we were to sell, say, there wouldn't be this trove of information on who has what diseases, just who was a customer.
Store the delta's + the identifying information somewhere else as a lookup table and use a random ID to join to it. Keep the PII database secured, offline, or whatever makes you feel best, and then if anyone needs direct correlation back to the end user, it is done through a different process that ensures higher access controls/auditing, etc.
For example taking your example of motor vehicle trips off the top of my head, in order the things that can ID you are:
If you had a database of these things, you could apply some of the strategies in the article, and a few others to ensure no collisions. For the above 3, you really may only need a few variables that are less constrained: gender, approximate age, type of vehicle so you could just compute out to those and store only that result. The point is think about what you need from the dataset and deliberately mess it up so that you'd have to have the original to piece it together. Often, you don't need the exact input data, but something within a random delta of it, so just keep the stuff within a random delta.