Data privacy is a huge topic right now for any companies using personal data, and recent legislative activities including the possibility of a new federal privacy law has brought it to the forefront. Consumer concerns are also growing with IBM reporting 78% believe a company's ability to keep their data private is very important.
At the same time, machine learning improves products, delivering user benefits such as increased personalization, tailored experiences, and less time manually filling in forms. But machine learning requires data to train the system: without data, it can’t function. So, businesses say they face a conundrum: how can they increase user privacy while still building products powered by machine learning?
As the IT decision makers for their organizations, chief information officers must embrace the idea that privacy is not just an on/off switch where they either collect and use all or none of the data. There are new methods that allow increased user privacy while still preserving the accuracy of machine learning systems. Here are three practical options CIOs can introduce to increase user privacy.
1. Limit the personal data you collect
One of the simplest ways to increase user privacy is to limit the amount of personal data that is collected in the first place. My team and I created an internal prototype that is based on the principle that privacy should be a sliding scale, not just an on-off switch.
Our idea is an adjustable software feature -- a privacy dial -- that lets users, or their companies, increase or decrease the type of information gathered by removing different levels of personally identifiable information. Developers can provide users a button for how much privacy they want, accompanied by an explanation of the benefits of each option. By understanding how levels of data sharing impact their user experience, users have greater knowledge and control.
At the lower dial settings, the personal data that can be used to directly identify a person is removed. As the setting increases, the data that is removed cannot be used to directly identify a single person, but it can still provide additional information about an individual. In most cases, personally identifiable information is not useful for a model’s predictions, so removing it does not affect the accuracy of the final model.
Federated learning is an more complex option for limiting the amount of data collected from users: A model is trained on a user’s device, then the trained model is passed to the central storage. This means the raw data never leaves a user’s personal device, but it still allows for high accuracy.
2. Only use a subset of the data
It’s also possible to increase user privacy at the stage where data is selected to train a machine learning model. One way to do this is to use k-anonymity to make users indistinguishable from others.
K-anonymity is achieved by aggregating or removing data that could indirectly reidentify a person (for example, the location of a business expense) until a certain number of entries are identical. “K” refers to the number of identical people in a dataset, so if k=3, then three entries in the dataset have identical combinations of sensitive data. However, this method can greatly decrease the accuracy of a machine learning model and does not provide a strong guarantee of privacy.
3. Prevent data leaks in the model’s predictions
Machine learning models can expose rare examples from their training data in their predictions, causing a possible loss of privacy to users. Differential privacy can prevent this. Differential privacy is a mathematical definition that guarantees that for any transformation of data, the probability of any specific result being returned is nearly the same, whether an individual is in a dataset or not. So, a differentially private machine learning model makes virtually the same predictions whether a person’s data is included or not -- it learns about the population, not the individual.
Google released Tensorflow Privacy, which allows the training of differentially private models. The new module for this popular machine learning framework offers strong mathematical guarantees that prevent individual users’ data being memorized, while still maximizing model accuracy. This is a big step forward because it fits in easily to existing workflows.
With these methods out there, there is no conundrum: increasing privacy doesn’t stop businesses using machine learning. It’s up to CIOs to expand their view on privacy and guide product developers and data scientists to start incorporating these methods into products.
Catherine Nelson is a Senior Data Scientist for Concur Labs at SAP Concur, where she explores innovative ways to use machine learning to improve the experience of a business traveler. She is particularly interested in privacy-preserving ML and applying deep learning to enterprise data. In her previous career as a geophysicist she studied ancient volcanoes and explored for oil in Greenland. Nelson has a PhD in geophysics from Durham University and a Masters of Earth Sciences from Oxford University.