If anonymity is important, what is the legal basis for defending it?
Cynics have been asking "is privacy dead?" for at least 40 years. Certainly information technology and ubiquitous connectivity have made it nearly impossible to hide, and so anonymity is critically ill. But privacy is not the same thing as secrecy; privacy is a state where those who know us respect the knowledge they have about us. Privacy generally doesn't require us hiding from anyone; rather it requires restraint on the part of those who hold Personal Data about us.
The typical public response to data breaches, government surveillance and invasions like social media facial recognition is vociferous. People usually energetically assert their rights to not be tracked online, and not to have information about them exploited behind their backs. These reactions show that the idea of privacy is alive and well.
The end of anonymity perhaps
Against a backdrop of spying revelations and excesses by social media companies especially in regards to facial recognition, there have been recent calls for a "new jurisprudence of anonymity"; see Yale law professor Jed Rubenfeld writing in the Washington Post of 13 Jan 2014. I wonder if there is another way to crack the nut? Because any new jurisprudence is going to take a very long time.
Instead, I suggest we leverage the way most international privacy law and privacy experience - going back decades - is technology neutral with regards to the method of collection. In some jurisdictions like Australia, the term "collection" is not even defined in data privacy regulations. Instead, the law just uses the plain English sense of the word when it frames principles like Collection Limitation: basically, you are not allowed to collect (by any means) Personal Data without a good and express reason. It means that if Personal Data gets into a data system, the operator of the system is accountable under privacy law for that data. It does not matter how it got there.
NOTE: "Personal Data" is now the dominant term in privacy and data protetcion law worldwide, with a pretty consistent definition, namely any information that can be reasonably identified with (that is, associated or linked with) a natural person. Personal Data does not need to be uniquely identifying. And, recgnising the potential for identification -- which might be done some time the in future through such means as data linkages -- a piece of data can be categorised as Personal Data before it is identified. Thus most experts agree that IP addresses, MAC addresses and photographs should be regarded as Personal Data. The term Personally Identifiable Information (PII) is common in the U.S. and has typically had a much tighter technical definition; for example under HHIPA rules, specific items such as zip code are enumerated as "PII". It is best to avoid the term "PII" in general data privacy discussions.
This technology neutral view of Personal Data collection has satisfying ramifications for all the people who intuit that Big Data has got too "creepy". We can argue that if a named record is produced afresh by a Big Data process (especially if that record is produced without the named person being aware of it, and from raw data that was originally collected for some other purpose) then that record has logically been collected. Whether Personal Data is collected directly or indirectly, or is in fact created by an obscure process, privacy law is largely agnostic.
Prof Robenfeld wrote in his article, "the NSA program isn't really about gathering data. It's about mining data. All the data are there already, digitally stored and collected by telecom giants, just waiting." [italics in original]
Logically, if the output of data mining is identifiable (and especially if the raw data input was previously anonymous) then the data mining operation is collecting Personal Data, albeit indirectly, untouched by human hands. As such, the data miners should be accountable for their newly minted Personal Data just as though they had collected gathered it directly from the persons concerned.
To reinforce the point, I wonder if we should have a special name for how Personal Data is created by Big Data processes, including re-identification? I suggest the name "Algorithmic Collection".
For now, I don't want to go into the rights and wrongs of NSA surveillance. I just want to show a new way to frame the privacy questions in surveillance and Big Data, making use of existing jurisprudence. If I am right and the NSA is in effect collecting fresh Personal Data out of raw data, as it goes about its data mining, then we may already have a legal framework for understanding what's going on, within which we can objectively analyse the rights and wrongs. The NSA might still be justified in mining data, and there might be no actual technical breach of information privacy law, if for instance the NSA enjoys a law enforcement exemption. These are important questions that need to be debated elsewhere (see my recent blog on our preparedness to actually have such a debate.
But Collection is not limited everywhere
There is an important legal-technical question in all this: Is the collection of Personal Data actually regulated? In the USA there is no general restriction against collecting Personal Data; there is no broad data protection law, and in any case, some versions of the US Fair Information Practice Princples (FIPPs) don't even feature Collection Limitation. So there may be few regulations in the USA that would carry my argument. Nevertheless, surely we can use international jurisprudence in Collection Limitation instead of creating new American jurisprudence around anonymity?
So I'd like to put the following questions to Jed Rubenfeld:
- Do technology neutral Collection Principles in theory provide a way to bring de-anonymised data within the scope for data privacy laws, and thus address peoples' concerns with Big Data?
- How might international jurisprudence around Collection Limitation translate to the American environment?
- Does this way of looking at the problem create new impetus for Collection Limitation to be introduced into American privacy principles?
Appendix: "Applying Information Privacy Norms to Re-Identification"
In 2013 I presented some of these ideas to an online symposium at the Harvard Law School Petrie-Flom Center, on the Law, Ethics & Science of Re-identification Demonstrations. What follows is an extract from that presentation, in which I spell out carefully the argument -- which was not obvious to some at the time -- that when genetics researchers combine different data sets to demonstrate re-identification of donated genomic material, they are in effect collecting patient Personal Data. I argue that this type of collection should be subject to ethics committee approval just as if the researchers were collecting the identities from the patients directly.
... I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently. In the first, Yaniv Erlich [at MIT's Whitehead Institute] used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be "very hard". In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.
A great deal of the debate around these cases has focused on the consent forms and the research subjects' expectations of anonymity. These are important matters for sure, yet for me the ethical issue in de-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements. The act of recording a person's name against erstwhile anonymous data represents a collection of personal information. The implications for genomic data re-identification are clear.
Let's consider Subject S who donates her DNA, ostensibly anonymously, to a Researcher R1, under some consent arrangement which concedes there is a possibility that S will be re-identified. And indeed, some time later, an independent researcher R2 does identify S and links her to the DNA sample. The fact is that R2 has collected personal information about S. If R2 has no relationship with S, then S has not consented to this new collection of her personal information.
Even if the consent form signed at the time of the original collection includes a disclaimer that absolute anonymity cannot be guaranteed, re-identifying the DNA sample later represents a new collection, one that has been undertaken without any consent. Given that S has no knowledge of R2, there can be no implied consent in her original understanding with R1, even if absolute anonymity was disclaimed.
Naturally the re-identification demonstrations have served a purpose. It is undoubtedly important that the limits of anonymity be properly understood, and the work of Yaniv and Latanya contribute to that. Nevertheless, these demonstrations were undertaken without the knowledge much less the consent of the individuals concerned. I contend that bioinformaticians using clever techniques to attach names to anonymous samples need ethics approval, just as they would if they were taking fresh samples from the people concerned.