In 2008, when Google began tracking flu-related search terms as a way to estimate flu infections, researchers were optimistic about the potential of the Internet as a medium for data mining. Since then, Google Flu Trends hasn't performed as well as hoped.
Now it's Twitter's turn. Scientists from Pennsylvania State University claim to have developed a way to identify Twitter posts that are viral in the medical sense of the word.
In a recently published paper, "On the Ground Validation of Online Diagnosis with Twitter and Medical Records," Penn State researchers say they have created "a system for making an accurate influenza diagnosis based on an individual's publicly available Twitter data."
The researchers obtained information from Penn State University's Health Services about 104 individuals who had been diagnosed with influenza by a medical professional during the 2012 through 2013 flu season. They also obtained data about 122 people who had not been diagnosed with the flu during this period. After discarding the data of a handful of individuals for a variety of reasons, the researchers set out to analyze the tweets from both groups in their study to determine whether they could diagnose influenza from Twitter posts.
[Facebook has made some big adjustments. Read Facebook Changes: What To Expect.]
The researchers demonstrated that they could indeed make that determination, with greater than 99% accuracy by combining text analysis, anomaly detection, and social network analysis.
There are related projects underway: The Parkinson's Voice Initiative, for example, is an effort to detect Parkinson's symptoms from voice analysis. But voice analysis involves active user participation; Twitter data is published and awaiting data miners.
The implications from a healthcare perspective are promising, as the Penn State research suggests a further method to complement traditional epidemiological data collection.
The implications from a privacy perspective, however, are rather chilling: "It would seem that simply avoiding discussing an illness is not enough to hide one's health in the age of big data," the researchers conclude.
The Penn State researches note that although they focused on remotely reconstructing a confidential diagnosis of influenza, this technique could be used to identify diseases associated with greater social stigma like HIV. Social media now clearly has a potential social cost.
At the same time, awareness of this technique could undermine it. That was part of the problem with Google Flu Trends -- news reports about influenza and about the way researchers were trying to correlate Google search queries with influenza cases made Google Flu Trends less accurate. There was more to it than that, however.
Reports in Nature in 2013 and Science in 2014 took issue with the accuracy of Google Flu Trends data during the flu 2011-2012 and 2012-2013 flu seasons. The paper that appeared in Science, "The Parable of Google Flu: Traps in Big Data Analysis," cited problems with Google's algorithm and what the paper's authors called "big data hubris," the assumption that online data collection can replace, rather than augment, traditional data collection methods.
Google has been taking steps to improve Flu Trends, but the authors of the the Science paper, David Lazer and Gary King of Harvard, Ryan Kennedy of the University of Houston, and Alessandro Vespignani of Northeastern University, in a separate paper, "Google Flu Trends Still Appears Sick: An Evaluation of the 2013-2014 Flu Season," claim that the issues identified with Google Flu Trends have gotten worse.
Despite some positive effects from Google's effort to dampen anomalous data spikes, the researchers say a major issue is Google's lack of transparency and lack of communication with researchers, who want access to Google's data to check its results. "[Google Flu Trends] has not been very forthcoming with [its data] in the past, going so far as to release misleading example search terms in previous publications."
"We review the Flu Trends model each year to determine how we can improve. We welcome feedback on how we can refine Flu Trends to help estimate flu levels and complement existing surveillance systems," a Google spokesperson said via email.
Social media data mining might provide unprecedented insight into undisclosed medical conditions, but it also provides ample opportunity for errors and raises profound privacy questions.
What do Uber, Bank of America, and Walgreens have to do with your mobile app strategy? Find out in the new Maximizing Mobility issue of InformationWeek Tech Digest.