How to Secure Data Privacy While Growing Machine Learning - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Data Management // Big Data Analytics
Commentary
10/11/2019
07:00 AM
Catherine Nelson, Senior Data Scientist, Concur Labs
Catherine Nelson, Senior Data Scientist, Concur Labs
Commentary
50%
50%

How to Secure Data Privacy While Growing Machine Learning

There are ways to increase user privacy while still preserving the accuracy of machine learning systems. Here are three practical options for CIOs.

Image: Duncanandison - stock.adobe.com
Image: Duncanandison - stock.adobe.com

Data privacy is a huge topic right now for any companies using personal data, and recent legislative activities including the possibility of a new federal privacy law has brought it to the forefront. Consumer concerns are also growing with IBM reporting 78% believe a company's ability to keep their data private is very important.

At the same time, machine learning improves products, delivering user benefits such as increased personalization, tailored experiences, and less time manually filling in forms. But machine learning requires data to train the system: without data, it can’t function. So, businesses say they face a conundrum: how can they increase user privacy while still building products powered by machine learning?

As the IT decision makers for their organizations, chief information officers must embrace the idea that privacy is not just an on/off switch where they either collect and use all or none of the data. There are new methods that allow increased user privacy while still preserving the accuracy of machine learning systems. Here are three practical options CIOs can introduce to increase user privacy.

1. Limit the personal data you collect

One of the simplest ways to increase user privacy is to limit the amount of personal data that is collected in the first place. My team and I created an internal prototype that is based on the principle that privacy should be a sliding scale, not just an on-off switch.

Our idea is an adjustable software feature -- a privacy dial -- that lets users, or their companies, increase or decrease the type of information gathered by removing different levels of personally identifiable information. Developers can provide users a button for how much privacy they want, accompanied by an explanation of the benefits of each option. By understanding how levels of data sharing impact their user experience, users have greater knowledge and control.

Image: Concur Labs Designer Jessica Park
Image: Concur Labs Designer Jessica Park

At the lower dial settings, the personal data that can be used to directly identify a person is removed. As the setting increases, the data that is removed cannot be used to directly identify a single person, but it can still provide additional information about an individual. In most cases, personally identifiable information is not useful for a model’s predictions, so removing it does not affect the accuracy of the final model.

Federated learning is an more complex option for limiting the amount of data collected from users: A model is trained on a user’s device, then the trained model is passed to the central storage. This means the raw data never leaves a user’s personal device, but it still allows for high accuracy.

2. Only use a subset of the data

It’s also possible to increase user privacy at the stage where data is selected to train a machine learning model. One way to do this is to use k-anonymity to make users indistinguishable from others.

K-anonymity is achieved by aggregating or removing data that could indirectly reidentify a person (for example, the location of a business expense) until a certain number of entries are identical. “K” refers to the number of identical people in a dataset, so if k=3, then three entries in the dataset have identical combinations of sensitive data. However, this method can greatly decrease the accuracy of a machine learning model and does not provide a strong guarantee of privacy.

3. Prevent data leaks in the model’s predictions

Machine learning models can expose rare examples from their training data in their predictions, causing a possible loss of privacy to users. Differential privacy can prevent this. Differential privacy is a mathematical definition that guarantees that for any transformation of data, the probability of any specific result being returned is nearly the same, whether an individual is in a dataset or not. So, a differentially private machine learning model makes virtually the same predictions whether a person’s data is included or not -- it learns about the population, not the individual.

Google released Tensorflow Privacy, which allows the training of differentially private models. The new module for this popular machine learning framework offers strong mathematical guarantees that prevent individual users’ data being memorized, while still maximizing model accuracy. This is a big step forward because it fits in easily to existing workflows.

With these methods out there, there is no conundrum: increasing privacy doesn’t stop businesses using machine learning. It’s up to CIOs to expand their view on privacy and guide product developers and data scientists to start incorporating these methods into products.

Catherine Nelson is a Senior Data Scientist for Concur Labs at SAP Concur, where she explores innovative ways to use machine learning to improve the experience of a business traveler. She is particularly interested in privacy-preserving ML and applying deep learning to enterprise data. In her previous career as a geophysicist she studied ancient volcanoes and explored for oil in Greenland. Nelson has a PhD in geophysics from Durham University and a Masters of Earth Sciences from Oxford University.

 

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Commentary
Why It's Nice to Know What Can Go Wrong with AI
James M. Connolly, Editorial Director, InformationWeek and Network Computing,  11/11/2019
Slideshows
Top-Paying U.S. Cities for Data Scientists and Data Analysts
Cynthia Harvey, Freelance Journalist, InformationWeek,  11/5/2019
Slideshows
10 Strategic Technology Trends for 2020
Jessica Davis, Senior Editor, Enterprise Apps,  11/1/2019
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Getting Started With Emerging Technologies
Looking to help your enterprise IT team ease the stress of putting new/emerging technologies such as AI, machine learning and IoT to work for their organizations? There are a few ways to get off on the right foot. In this report we share some expert advice on how to approach some of these seemingly daunting tech challenges.
Slideshows
Flash Poll