The power of Data Privacy law

Large language models (LLMs) and generative AI are developing at ever-increasing rates, alarming many commentators because it is so hard now to tell fact from fiction.  Deep fakes were a central issue in the recent Hollywood writers’ strike, with many creators and actors anxious to protect their personal identities against the possibility of being replaced by synthetic likenesses.

Naturally there are calls for new regulations. We should also be looking at how AI comes under the principles-based privacy regulations we already have.

There is a large and growing body of international principles-based data privacy. These are based on the idea of personal data, which is broadly defined to mean essentially any information which is associated or may be associated with an identifiable natural person. Data privacy laws such as the GDPR (not to mention 162 national statutes) operate to restrain the collection, use and disclosure of personal data.

Generally speaking, these laws are technology neutral; they are blind to the manner in which personal data is collected. And they apply to essentially any data processing scenario.  

So, this means that the outputs of AIs, when personally identifiable, are within the scope of data privacy laws in most places around the world. If personal data comes to be in a database by any means whatsoever then it may be deemed to have been collected.

Thus, data privacy laws apply to personal data generated by an algorithms, untouched by human hands.


Time and time again the privacy implications of automated person information flows seem to take technologists by surprise:

  • In 2011 German privacy regulators found that Facebook’s photo tag suggestions feature violated the law and called on the company to cease facial recognition and delete its biometric data. Facebook took the prudent approach of shutting down its facial recognition usage worldwide, and subsequently took many years to get it going again.  
  • The counter-intuitive Right to be Forgotten (RTBF) first emerged in the 2014 “Google Spain” case heard by the European Court of Justice.  The case was not actually about “forgetting” anything but related specifically to de-indexing web search results. The narrow scope serves to highlight that personal data generated by algorithms (for that’s what search results are) is covered by privacy law. In my view, search results are not simple replicas of objective facts found in the public domain; they are the computed outcomes of complex Big Data processes.

While technologists may presume (or hope) that synthetic personal data escapes privacy laws, the general public would expect there to be limits on how information about them is generated behind their backs by computers. In many ways, judgements produced by algorithms raise more concerns that traditional human judgements.

What’s next?

The legal reality is straightforward. If an information system comes to hold personal data, by any means, then the organisation in charge of that system has collected personal data and is subject to data privacy laws.

As we have seen, analytics and Big Data processes have been brought to heel by data privacy laws.

Artificial Intelligence may be next.

Responsibility for Simulated Humans

Large language models are enabling radically realistic simulations of humans and interpersonal situations, with exciting applications in social science, behaviour change modelling, human resources, healthcare and so on.

As with many modern neural networks, the behaviour of the systems themselves can be unpredictable. A recent study by researchers at Stanford and Google revealed “simulacra” (that is, robotic software agents) built on ChatGPT spontaneously exchanging personal information with each other, without being scripted by the software’s authors.

That is, the robots were gossiping, behind the backs, as it were, of the humans who developed them.

If this level of apparent autonomy is surprising, then bear in mind widespread reporting that nobody knows exactly how Deep Neural Networks work.

Bill Gates calls AI the most powerful technology seen in decades.  Given how important it is, can society accept that AI leaders and entrepreneurs can’t tell us what’s going on under the hood? 

Ignorance is No Excuse

Well-established privacy law shows that AI’s leaders might have to take more interest in their creations’ inner workings. Regulators might not find it acceptable that AI operators can’t necessarily tell how personal data arises in their systems. By the same token, they cannot even be sure what personal data is being generated internally and retained.

If a large language model generates personal data, then the people running the model are in principle accountable for it under data privacy rules. And it may not matter to regulators if the knowledge personal data is distributed through an impenetrable neutral network of parameters and weights buried in hidden layers

Privacy law requires that any personal data created and held by an LLM must be collected for a clear purpose, the collection must be proportionate to that purpose, and it must be transparent.  Personal data created in an LLM must not be used or disclosed for unrelated purposes (and in Europe, the individuals concerned have further rights in some cases to have the data erased).

I am not a lawyer, but I don’t believe that the owner of a deep learning system that holds personal data can excuse themselves from technology-neutral privacy law just because they don’t know exactly how the data got there.  Nor can they get around the right to erasure by appealing to the weird and wonderful ways that knowledge is encoded in neutral networks.

If an AI’s operator cannot comply with data privacy law, then a worst case scenario could see an activist data protection authority finding the technology to be unsafe, and ruling that it be shut down until such time as personal data flows can be fully and properly accounted for.