With a handful of social media posts or other geotagged sources of data, researchers have demonstrated that they can identify an individual's purchases from an anonymized set of metadata. In a paper published Friday in Science, researchers from MIT, Aarhus University, and Rutgers show that having just four metadata points that link an individual to a location and/or time -- a Tweet tied to a place or a timestamped receipt, for example -- that individual's credit card transactions can be identified 90% of the time in anonymized database of 1.1 million credit card records.
"A data set's lack of names, home addresses, phone numbers, or other obvious identifiers … does not make it anonymous or safe to release to the public and to third parties," the paper explained.
The significance of these findings can't be underestimated. Large-scale anonymized sets of data and metadata have become indispensable in academia, industry, and government, according to the paper. Epidemiologists rely on anonymized data to track disease outbreaks. Retailers rely on anonymized data sets for business insights. Netflix uses anonymized viewing data to make recommendations. Google uses anonymized location data to provide traffic information.
To underscore the value of data for businesses, Greg Corrado, senior research scientist at Google, put it thus at the ReWork Deep Learning Summit in San Francisco on Thursday: "If you don't have a mountain of data, you probably should have a mountain of data."
[Are we our own worst enemies when it comes to data security? Read Password Fail: Are Your Workers Using 123456?]
The paper's authors -- MIT graduate student Yves-Alexandre de Montjoye, MIT professor Alex "Sandy" Pentland, Rutgers assistant professor Vivek Singh, and Tel Aviv University post-doctoral student Laura Radaelli -- use data and metadata interchangeably. Though there's an arguable distinction -- metadata is data that describes other data -- that distinction exists more as a political one, to define the parameters of allowable surveillance, than as a fundamental difference.
Data is the fuel that powers the knowledge economy, but as the paper's authors point out, the transformational power of data depends on its availability. Such data may become more difficult to obtain if it can be de-anonymized.
This study isn't the first to note the ease with which anonymous data sets can be de-anonymized to identify specific individuals. Researchers found ways to identify people from the AOL search query dataset released in 2006.
A paper published by researchers at the University of Texas at Austin in 2008 describes how the Netflix Prize dataset was de-anonymized. In 2013, one of the MIT paper's authors, Yves-Alexandre de Montjoye, participated in prior work on identifying mobile phone users from four points of spatiotemporal data. That same year, Latanya Sweeney showed that anonymous Personal Genome Project listings can be identified by name using data in public records.
Attend Interop Las Vegas, the leading independent technology conference and expo series designed to inspire, inform, and connect the world's IT community. In 2015, look for all new programs, networking opportunities, and classes that will help you set your organization’s IT action plan. It happens April 27 to May 1. Register with Discount Code MPOIWK for $200 off Total Access & Conference Passes.