Data by default: How AI radically changes the data privacy landscape

Article
October 2024

IN BRIEF

  • Protecting the rights that individuals have about what information is collected, where it is stored, who can use it, and for what purpose has always been difficult. And AI makes that exceedingly more complex.
  • There is no good way of making an AI application “forget” or “unlearn” what it has unlawfully learned. And the more time passes without corrective actions, the harder and costlier these instances become.
  • Business leaders need to stay informed in a rapidly evolving landscape with no comprehensive federal regulation or global set of standards. Eventually, companies will be asked to create data privacy scorecards so they should meticulously document all practices and procedures.

“Data is a precious thing and will last longer than the systems themselves.”

- Tim Berners-Lee, colloquially known as the inventor of the World Wide Web


ABOUT

Mauro F. Guillén
Vice Dean, Wharton School
University of Pennsylvania

Mauro F. Guillén is Professor of Management and Vice Dean at The Wharton School of the University of Pennsylvania. An expert on global market trends, Guillén combines his training as a sociologist and as a business economist in his native Spain to identify and quantify opportunities at the intersection of demographic, economic, and technological developments. He is the author of 2030: How Today’s Biggest Trends Will Collide and Reshape the Future of Everything, and The Perennials: The Megatrends Creating a Post-Generational Society.

It’s been almost twenty years since Berners-Lee uttered those words and they are truer, and perhaps even a little more ominous, today than they were then. The advent of artificial intelligence (AI) makes data even more valuable, and thus raises the issue of data privacy and data ownership to a new level of importance, complexity and controversy.

AI can be fed people’s private data from different sources — not just online and offline databases but also content that users upload to the Web, sensor data from the Internet of Things, and even the digital footprint users leave behind as they use their digital devices. Protecting the rights that individuals have as to what information is collected, where it is stored, who can use it, and for what purpose has always been difficult. And in the future, AI will make that exceedingly complex.

Data and digital footprints

We entered the era of “data collection by default” some time ago, argue Stanford University’s Jennifer King and Caroline Meinhardt in a recent comprehensive analysis. There are two potential ways to address this issue. The mostly unworkable one at this point is to move from an opt-out to an opt-in system. The issue with that is not just regulating and ensuring compliance, but also what to do about information collected in the past. And companies can still encourage users to opt in through special offers and other incentives, and then use the data for purposes that were not anticipated.

A second solution is to develop applications that prevent third parties from collecting activity data in the first place, such as enabling the opt out option when we download an app to our smartphones. But this only applies to activity data, not to the data the user supplies while searching or transacting once the app is installed.

Moreover, efforts at controlling information at the point of collection are undermined by Web crawlers and Web scrapers, which can automatically locate, classify, download and extract vast amounts of data, images and other types of material from the internet writ large. In principle, they can only access public, readily viewable material. But in practice, Web crawlers can jump over paywalls by disguising themselves as users and can use pirated content that has been stored somewhere other than the original location.

In addition, data often get misplaced, breached, leaked or otherwise mishandled, making it an easy target for AI Web crawlers. Thus, the issue goes well beyond the traditional approaches of offering assurances about confidentiality or non-disclosure and establishing opt-in or opt-out mechanisms.

The AI data supply chain

Given the difficulties involved in addressing privacy issues at the point of data collection, options at a later stage of the data supply chain must be considered. The broadest measure would be to ask companies to disclose basic information about the data they feed into AI, indicating the sources, scope and scale. This enables, for example, checking if there is copyright infringement.

efforts at controlling information at the point of collection are undermined by Web crawlers and Web scrapers, which can automatically locate, classify, download and extract vast amounts of data, images and other types of material from the internet writ large.

Image
facial recognition software

However, no such requirement or regulation exists, and very few companies do it voluntarily. Some companies are now offering users an opt-out option so that their data and images are not used for AI training. Amazon’s AWS, Google’s Gemini, and OpenAI offer such options, but they are often cumbersome to activate and not totally fool-proof.

The supply chain ends with outputs, which in the case of AI include applications and predictions. Individuals need to be protected if their data are unwittingly disclosed at that point. At the societal level, the thorny problem with using training data from the web is that, even if all permissions and legal requirements were to be met, there is issue of “bias in, bias out,” in the sense that data on the web are not representative of society or of the world. Some users, companies, and countries are more prone to uploading material or leaving behind a digital footprint. Such a biased body of data then becomes the raw material for AI applications.

AI never forgets…

The truth is that individuals, unfortunately, have very few options at their disposal to prevent the misuse or unauthorized use of their data. It is also exceedingly difficult to compel companies to delete data at the user’s request even if it is mandated by law. More alarmingly, there is no good way of making an AI application to “forget” or “unlearn” what it has unlawfully learned. And the more time passes without corrective actions, the harder and costlier these instances become.

As is often the case with emerging technologies, regulation is lagging. And, not surprisingly, there is much debate as to the amount of regulatory oversight that is necessary, warranted or desirable. Adding to the complexity, digital data are global while regulation is local.

According to the United Nations, 137 out of 194 countries have passed data protection and privacy legislation with various levels of safeguards. The Web is a global medium, but it is subject to a mosaic of regulations at the supranational (the European Union), national and subnational levels (i.e., state by state, as in the Unites States). Most importantly, regulations aimed at the Web or AI sometimes collide with those in other areas, like national security. The European Union has complained about American intelligence agencies’ use of private data of EU citizens and residents without their approval. This issue is complicated by the fact that large and small American digital platforms routinely send user data to the U.S. The U.S.-EU Data Privacy Framework, signed in July 2023, regulates the circumstances under which the U.S. can gather information and how European citizens can appeal.

Unintended consequences

From the standpoint of the companies managing digital platforms, the regulatory context could not be more complex. They need to comply with regulations implemented by the country where they are based but also by the laws of the countries in which they collect data from their users. In addition, cross-border data flows may also be regulated: This represents a major obstacle for new startups aiming at international growth, while offering a built-in advantage to more established companies that have the resources to either comply or to deal with the potential litigation if they do not comply.

The truth is that individuals, unfortunately, have very few options at their disposal to prevent the misuse or unauthorized use of their data. It is also exceedingly difficult to compel companies to delete data at the user’s request even if it is mandated by law. 

Image
privacy regulation

The future of personal data protection and privacy remains uncertain. And yet, companies need to make operational decisions today that may be legally questionable in the future. Companies, especially those engaged in large-scale AI efforts, will continue to amass data and to use it to advance their goals, even at the risk of being found non-compliant.

Another unintended consequence stems from applying new regulations to both tech companies whose core business involves data collection and manipulation, especially those engaged in AI, and those in other industries, which gather and process data in support of selling other products or services. On the one hand, the concern is that complying with regulations designed to prevent the worst potential harms might constrain the ability of such companies to compete. But on the other hand, many companies whose core business is not AI are also developing, or at least using, AI applications. Thus, the default for politicians and regulators is to make all companies comply.

Requirements, standards and scorecards                                                             

It is not clear yet if different jurisdictions around the world will treat all companies the same way, or have less onerous requirements for small firms, and for data that are not deemed “sensitive” (sensitive data includes but is not limited to financial, health, biometric and genetic information), as in the proposed American Privacy Rights Act of 2024.

Board directors and business leaders need to stay hyper-informed in a rapidly evolving landscape. There are many proposals on the table in terms of legislative initiatives, but no comprehensive federal regulation in the U.S. yet, let alone a global set of standards other than the decades-old principles of information minimization and information specificity.

Eventually, companies will be asked to create data privacy scorecards, so they should keep track and meticulously document all practices and procedures. In the meantime, they need to exercise sound business privacy practices to to avoid bad publicity, public-relations problems and a loss of customer trust over data mismanagement and hacking.

Board directors and business leaders need to stay hyper-informed in a rapidly evolving landscape. There are many proposals on the table in terms of legislative initiatives, but no comprehensive federal regulation in the U.S. yet, let alone a global set of standards.

Add a Comment
* Required
Comments
No comments added yet.