With the market for artificial intelligence (AI) projected to reach 190 billion dollars by 2025, AI is quickly becoming a cornerstone technology of the global economy. This growth is all thanks to the democratization of computing power made possible by cloud computing and the increasing availability of the open data upon which AI systems are trained. While these developments come at a time of heightened concern about consumer privacy, a range of new technologies and data governance processes can ensure that open data and AI need not come at the expense of confidentiality. Developers can now leverage tools — referred to as “privacy-enhancing technologies” — that minimize the exposure of personal data while enabling it to be used in a range of socially beneficial ways.
As we outline in our Open Data Agenda, privacy-enhancing technologies present exciting opportunities to increase the availability of government data and unleash a new wave of private sector data sharing that can fuel innovation without compromising on security or confidentiality. We highlight below three such technologies — federated learning, differential privacy, and homomorphic encryption — that are opening doors to high-value collaboration while preserving the privacy of the underlying data.
Federated Learning
Federated learning is a machine-learning technique that enables researchers to train an algorithm on multiple, decentralized data sets. Whereas traditional machine-learning techniques require developers to amass a large pool of training data (i.e., copying data and moving it to a centralized server), federated learning works by bringing the algorithm to the original data source. By leveraging federated learning, developers need not gain direct access to or create copies of data for purposes of training an AI model – the training process can instead take place on the servers where the data already resides. Because the data remains protected at its source, the institution that maintains the data can continue to enforce use restrictions that preserve underlying privacy and security. Federated learning thus breaks down siloes that previously inhibited socially beneficial collaboration because of privacy concerns.
Federated learning is already proving impactful in the healthcare field, where confidentiality concerns are paramount but where data-enabled innovation has the potential to save lives. For example, Intel is partnering with the University of Pennsylvania’s Perelman School of Medicine to create a world-class federated learning platform that can spot brain tumors in cancer patients. That system will be trained using federated learning on medical data from Penn Medicine and 29 other institutions across the world. By leveraging federated learning, the platform allows the system to be trained on each institution’s data without exposing any individual patient’s medical record.
Federated learning can also be useful in industrial applications, helping to improve the software that runs complex equipment such as jet engines or autonomous vehicles. By training the algorithms on data collected by many pieces of equipment owned by different companies, each customer can benefit from insights about how the equipment runs in the real world that will help its own system work more efficiently while keeping sensitive business data private. BSA members such as Siemens are conducting further research on the benefits of federated learning techniques for industrial applications that will improve efficiency on the factory floor and out in the field.
Differential Privacy
Sharing data about individuals without inadvertently disclosing information that can be used to identify data subjects is notoriously complex, particularly when dealing with large data sets. To address this problem, data is usually scrubbed or anonymized, which means that direct or indirect data about an individual is removed to make them less identifiable. However, efforts to anonymize data can often undermine its overall utility. In addition, anonymization techniques are not always foolproof, and can be vulnerable to “re-identification” attacks that can link an individual to data even when personally identifiable information had been scrubbed. For instance, in one famous example, researchers demonstrated that they could identify the individuals associated with an “anonymized” database of movie reviews simply by cross-referencing it with other publicly available databases.
Differential privacy addresses the risks associated with traditional anonymization techniques, making it possible to share large, aggregated datasets without the risk of re-identification. Differential privacy works by adding “statistical noise” into large datasets in a manner that preserves the broader trends and statistical accuracy of the compiled data but makes it impossible to extract information about any individual data subject. This technique is often compared to adding static noise like you would hear on a radio to data- you might not be able to pick out which parts of the output are the music or the static, but you can still get the general gist of the song while listening. It allows the data to be useful and actionable while providing a provable guarantee of privacy.
Source: minutephysics and the U.S. Census Bureau
The United States Census Bureau is using differential privacy to help fulfill its mission as the nation’s “leading provider of quality data about its people and economy” while adhering to a legal restriction that prohibits it from publishing data that can be linked to any individual person. The Census Bureau’s goal is to publish as much useful data as possible while ensuring that it does not publish anything that can be linked to an individual person. Following the completion of the 2010 census, the Bureau recognized that the techniques it had been using to “de-identify” the data it published were quickly becoming vulnerable “re-identification” efforts enabled by new analytic techniques and a proliferation of third‐party data sources. To mitigate these risks, the Census Bureau turned to differential privacy as part of the 2020 Census to continue releasing high-quality data while preserving the confidentiality of personally identifiable information.
Because differential privacy makes it possible to distribute sensitive data sets to users without exposing information that can be linked back to any individuals, the technique is also particularly useful in research contexts. Microsoft has created a platform that will be used in partnerships with Harvard University’s Institute for Quantitative Social Sciences, where differential privacy will help researchers across fields share their data sets without risk to privacy protections. By creating a resource connecting people with diverse data sets, the project can enable better analysis related to topics ranging from refugee resettlement to climate change and beyond. The platform will also serve as the basis for the Cascadia Data Discovery Institute, where differential privacy will drive data sharing and collaboration around biomedical cancer research in the Northwest.
Homomorphic Encryption
Encryption is in many ways the backbone of the modern digital economy, ensuring that online communications remain private, payment transactions are secure, and electronic records cannot be tampered with. Encryption works by mathematically transforming data into undecipherable text that can only be read by authorized people with the right decryption key. Ordinarily, data is encrypted when it is “at rest” (i.e., being stored) and when it is “in transit” (i.e., when it is being transferred between users). However, to make use of encrypted data, users have typically had to decrypt the data to perform computational processes on it. The period in which data is decrypted creates a window of opportunity for bad actors to misappropriate the data and use it in ways that might undermine confidentiality. Homomorphic encryption builds on the benefits of encryption and helps eliminate vulnerabilities by allowing data to be processed while it remains fully encrypted.
Homomorphic encryption holds significant promise for use in industries that produce lots of sensitive data, such as the healthcare and financial services sectors. In pioneering new research, IBM has demonstrated that homomorphically encrypted data can be used to train AI systems without exposing any of the underlying data to human eyes. IBM is now leveraging the technique in a new partnership with Brazilian bank Banco Bradesco SA to create AI models that can learn from sensitive financial records while they remain encrypted. The potential for training AI systems on encrypted data means that banks— and other heavily regulated entities— will be able to outsource the development of prediction models that can be trained on data that could not previously be shared externally due to security concerns.
Looking Ahead
As businesses and government agencies alike think about how they can do more with data, privacy-enhancing technologies offer an important pathway for increasing collaboration around shared data resources in a manner that aligns with the public’s expectation of privacy. Around the world, there is an increasing awareness that government data is a vastly underutilized resource. To help people across industries and communities take advantage of the benefits of AI, governments should explore how privacy-enhancing technologies like these can be leveraged to make more government data available as part of broader open data strategies.