In 2016, the Australian government released, for research purposes, an extract of public health insurance data, comprising the 30-year billing history of ten percent of the population, with medical providers and patients purportedly de-identified. Melbourne University researcher Dr Vanessa Teague and her colleagues famously found quite quickly that many of the providers were readily re-identified.  The dataset was withdrawn, though not before many hundreds of copies were downloaded from the government website.  

The government’s responses to the re-identification work were emphatic but sadly not positive.  For one thing, legislation was written to criminalize the re-identification of ostensibly ‘anonymised’ data, which would frustrate work such as Teague’s regardless of its probative value to ongoing privacy engineering (the bill has yet to be passed). For another, the Department of Health insisted that no patient information had been compromised.  That was then. 

It seems less ironic than inevitable that in fact the patients’ anonymity was not to be taken as read.  In follow-up work released today, Teague, with Dr Chris Culnane and Dr Ben Rubinstein, have shown how patients in that data release may indeed be re-identified. 

The ability to re-identify patients from this sort of Open Data release is frankly catastrophic.  The release of imperfectly de-identified healthcare data poses real dangers to patients with socially difficult conditions.  This is surely well understood.  What we now need to contend with is the question of whether Open Data practices like this deliver benefits that justify the privacy risks.  That’s going to be a trick debate, for the belief in data science is bordering on religious.  

It beggars belief that any government official would promise "anonymity" any more. These promises just cannot be kept. 

Re-identification has become a professional sport.  Researchers are constantly finding artful new ways to triangulate individuals’ identities, drawing on diverse public information, ranging from genealogical databases to social media photos.  But it seems that no matter how many times privacy advocates warn against these dangers, the Open Data juggernaut just rolls on.  Concerns are often dismissed as academic, or being trivial compared with the supposed fruits of research conducted on census data, Medicare records and the like. 

In "Health Data in an Open World (PDF)" Teague et al warn (not for the first time) that "there is a misconception that [protecting the privacy of individuals in these datasets] is either a solved problem, or an easy problem to solve” (p2). They go on to stress “there is no good solution for publishing sensitive unit-record level data that protects privacy without substantially degrading the usefulness of the data" (p3). 

What is the cost-benefit of the research done on these data releases? Statisticians and data scientists say their work informs government policy, but is that really true? Let’s face it.  "Evidence based policy" has become quite a joke in Western democracies. There are umpteen really big public interest issues where science and evidence are not influencing policy settings at all.  So I am afraid statisticians need to be more modest about the practical importance of their findings when they mount bland “balance” arguments that the benefits outweigh the risks to privacy. 

If there is a balance to be struck, then the standard way to make the calculation is a Privacy Impact Assessment (PIA). This can formally assess the risk of “de-identified” data being re-identified. And if it can be, a PIA can offer other, layered protections to protect privacy. 

So where are all the PIAs?  

Open Data is almost a religion. Where is the evidence that evidence-based policy making really works?  

I was a scientist and I remain a whole-hearted supporter of publicly funded research. But science must be done with honest appraisal of the risks. It is high time for government officials to revisit their pat assertions of privacy and security. If the public loses confidence in the health system's privacy protection, then some people with socially problematic conditions might simply withdraw from treatment, or hold back vital details when they engage with healthcare providers. In turn, that would clearly damage the purported value of the data being collected and shared. 

Big Data-driven research on massive public data sets just seems a little too easy to me.  We need to discuss alternatives to Open Data bulk releases.  One option is to confine extracted research data to secure virtual data rooms, and grant access only to specially authorised researchers. These people would be closely monitored and audited; they would comprise a small set of researchers; their access would be subject to legally enforceable terms & conditions. 

There are compromises we all need to make in research on human beings.  Let’s be scientific about science-based policy.  Let’s rigorously test our faith in Open Data, and let’s please stop taking “de-identification” for granted.  It’s really something of a magic spell.