When OpenAI released GPT-3 in July 2020, it offered a glimpse of the data used to train the large language model. Millions of pages scraped from the web, Reddit posts, books, and more are used to create the generative text system, according to a technical paper. Scooped up in this data is some of the personal information you share about yourself online. This data is now getting OpenAI into trouble.
On March 31, Italy’s data regulator issued a temporary emergency decision demanding OpenAI stop using the personal information of millions of Italians that’s included in its training data. According to the regulator, Garante per la Protezione dei Dati Personali, OpenAI doesn’t have the legal right to use people’s personal information in ChatGPT. In response, OpenAI has stopped people in Italy from accessing its chatbot while it provides responses to the officials, who are investigating further.
The action is the first taken against ChatGPT by a Western regulator and highlights privacy tensions around the creation of giant generative AI models, which are often trained on vast swathes of internet data. Just as artists and media companies have complained that generative AI developers have used their work without permission, the data regulator is now saying the same for people’s personal information.
Similar decisions could follow all across Europe. In the days since Italy announced its probe, data regulators in France, Germany, and Ireland have contacted the Garante to ask for more information on its findings. “If the business model has just been to scrape the internet for whatever you could find, then there might be a really significant issue here,” says Tobias Judin, the head of international at Norway’s data protection authority, which is monitoring developments. Judin adds that if a model is built on data that may be unlawfully collected, it raises questions about whether anyone can use the tools legally.
Italy’s blow to OpenAI also comes as scrutiny of large AI models is steadily increasing. On March 29, tech leaders called for a pause on the development of systems like ChatGPT, fearing its future implications. Judin says the Italian decision highlights more immediate concerns. “Essentially, we’re seeing that AI development to date could potentially have a massive shortcoming,” Judin says.
The Italian Job
Europe’s GDPR rules, which cover the way organizations collect, store, and use people’s personal data, protect the data of more than 400 million people across the continent. This personal data can be anything from a person’s name to their IP address—if it can be used to identify someone, it can count as their personal information. Unlike the patchwork of state-level privacy rules in the United States, GDPR’s protections apply if people’s information is freely available online. In short: Just because someone’s information is public doesn’t mean you can vaccuum it up and do anything you want with it.
Italy’s Garante believes ChatGPT has four problems under GDPR: OpenAI doesn’t have age controls to stop people under the age of 13 from using the text generation system; it can provide information about people that isn’t accurate; and people haven’t been told their data was collected. Perhaps most importantly, its fourth argument claims there is “no legal basis” for collecting people’s personal information in the massive swells of data used to train ChatGPT.
“The Italians have called their bluff,” says Lilian Edwards, a professor of law, innovation, and society at Newcastle University in the UK. “It did seem pretty evident in the EU that this was a breach of data protection law.”
Broadly speaking, for a company to collect and use people’s information under GDPR, they must rely on one of six legal justifications, ranging from someone giving their permission to the information being required as part of a contract. Edwards says that in this instance, there are essentially two options: getting people’s consent—which OpenAI didn’t do—or arguing it has “legitimate interests” to use people’s data, which is “very hard” to do, Edwards says. The Garante tells WIRED it believes this defense is “inadequate.”
However, GPT-4’s technical paper includes a section on privacy, which says its training data may include “publicly available personal information,” which comes from a number of sources. The paper says OpenAI takes steps to protect people’s privacy, including “fine-tuning” models to stop people asking for personal information and removing people’s information from training data “where feasible.”
“How to collect data lawfully for training data sets for use in everything from just regular algorithms to some really sophisticated AI is a critical issue that needs to be solved now, as we’re kind of on the tipping point for this sort of technology taking over,” says Jessica Lee, a partner at law firm Loeb and Loeb.
The action from the Italian regulator—which is also taking on the Replika chatbot—has the potential to be the first of many cases examining OpenAI’s data practices. GDPR allows companies with a base in Europe to nominate one country that will deal with all of its complaints—Ireland deals with Google, Twitter, and Meta, for instance. However, OpenAI doesn’t have a base in Europe, meaning that under GDPR, every individual country can open complaints against it.
OpenAI isn’t alone. Many of the issues raised by the Italian regulator are likely to cut to the core of all development of machine learning and generative AI systems, experts say. The EU is developing AI regulations, but so far there has been comparatively little action taken against the development of machine learning systems when it comes to privacy.
“There is this rot at the very foundations of the building blocks of this technology—and I think that’s going to be very hard to cure,” says Elizabeth Renieris, senior research associate at Oxford’s Institute for Ethics in AI and author on data practices. She points out that many data sets used for training machine learning systems have existed for years, and it is likely there were few privacy considerations when they were being put together.
“There’s this layering and this complex supply chain of how that data ultimately makes its way into something like GPT-4,” Renieris says. “There’s never really been any type of data protection by design or default.” In 2022, the creators of one widely used image database, which has helped trained AI models for a decade, suggested images of people’s faces should be blurred in the data set.
In Europe and California, privacy rules give people the ability to request that information be deleted or corrected if it is inaccurate. But deleting something from an AI system that is inaccurate or that someone doesn’t want there may not be straightforward—especially if the origins of the data are unclear. Both Renieris and Edwards question whether GDPR will be able to do anything about this in the long term, including upholding people’s rights. “There is no clue as to how you do that with these very large language models,” says Edwards from Newcastle University. “They don’t have provision for it.”
So far, there has been at least one relevant instance, when the company formerly known as Weight Watchers was ordered by the US Federal Trade Commission to delete algorithms created from data it didn’t have permission to use. But with increased scrutiny, such orders could become more common. “Depending, obviously, on the technical infrastructure, it may be difficult to fully clear your model of all of the personal data that was used to train it,” says Judin, from Norway’s data regulator. “If the model was then trained by unlawfully collected personal data, it would mean that you would essentially perhaps not be able to use your model.”