Using NHS data for scientific research – without creating a privacy nightmare

jjgreen · 2024-04-18T12:40:54 1713444054

How many people living within 10 metres of [the lattitude, longitude of your address] have [embarrassing condition]?

bloodearnest · 2024-04-18T15:10:28 1713453028

FWIW, the highest precision of location available is UK's MSOA[1] regions, which is more like your whole neighbourhood than your street.

And whilst that level of data is available, it is not allowed to be released from the OpenSAFELY system at that level, it must be aggregated.

[1] https://geoportal.statistics.gov.uk/search?collection=Datase...

edit: added a missing word

jjgreen · 2024-04-18T19:14:42 1713467682

How many Caucasian males born between 12/01/97 and 12/01/97 in MSOA region 12523 with <some not too common condition I know you have> have <embarrassing condition> [1]

My point is that aggregated data becomes an indicator function on an individual if you allow arbitrary queries; then you can extract all information on that individual by exhaustion:

For each <embarrassing condition> run [1]

bloodearnest · 2024-04-22T10:02:41 1713780161

Indeed it does, we are well aware of this, and write about it extensively.

The leader of the OpenSAFELY project produced the Goldacre review for the UK Government, which discusses this risk in detail. Here's a quote from the executive summary:

  Knowing the approximate date range in which someone had a medical intervention, their 
  approximate age, and their approximate location is often enough to re-identify someone in a 
  pseudonymised dataset, and then – illegally – to see everything else in their record. Women 
  face particular concerns: knowing someone’s approximate age, approximate location, and the 
  approximate time at which they had children can also often be enough to make a confident 
  unique match; this is the kind of information that will be known by someone at the school 
  gate, or a colleague. This is not to say that health data users are untrustworthy: but the 
  system must be resilient to untrustworthy users; and it is well documented that other large 
  administrative national datasets are sometimes misused.[1]

That's why there are multiple layers of security within the OpenSAFELY system. Other layers include: multiple authentication required from separate orgs to view outputs, various limits on the outputs you can view, full public audit trail of all code run, and all outputs viewed, separation between the systems used to run code and to view outputs (so compromise of one system doesn't affect the other). And finally, you cannot release an output from a secure backend until it has been checked for potential disclosivity by 2 ONS[2] trained reviewers.

Like all computer system, you cannot remove all risk from the system. But you can reduce it significantly, and make all activity public and auditable. And it's a damn sight better than current practice for medical data research, tbh. We try as much as possible to provide data-minimisation, in that you only have the minimal set of data you need answer your question. Whereas, in the UK at least, most of the time, researchers just get given access to everything.

[1]https://www.gov.uk/government/publications/better-broader-sa...

[2] Office of National Statistics, UK equivalent of the US Census Bureau

EDIT: fixed a reference

jjgreen · 2024-04-22T19:52:14 1713815534

I'd agree with your

Like all computer system, you cannot remove all risk from the system. But you can reduce it significantly ...

so that we can both agree that the article's statement

It lets research scientists dig into 58m NHS health records – while seemingly doing the impossible and completely preserving patient privacy in the process.

is so much tosh. That being the case, I do not consent to my medical records being used in this way, not that my consent will be sought. "Do no harm" my arse.