Want to know if--and when--you'll catch the flu bug this season? TwitterHealth, a research project at the University of Rochester in New York, can predict with uncanny accuracy which Twitter users will become ill--simply by studying their tweets.
The project, designed to show how researchers can use data mining and machine learning to build knowledge systems, was a topic of discussion at the Rochester Big Data Forum, a three-day conference of computer scientists held Oct. 4-6. The forum was the first event of the university's newly launched RocData, or Rochester Big Data Initiative, which is designed to inspire collaboration between data scientists and researchers in other fields, such as medicine and education.
TwitterHealth began as a research project to explore various ways of analyzing geographic information, such as GPS data from cellphones, noted Henry Kautz, chair of the University of Rochester's computer science department and organizer of the Big Data Forum.
"We realized that more social media sites are including geographic data automatically in the posts you make," said Kautz in a phone interview with InformationWeek. "So when people post to Twitter from their cellphones, by and large you get the location, and you can download that data."
[ Learn about a similar, government-led big data initiative. See Twitter App Tracks Illness Outbreaks. ]
Kautz' students set up a network of computers that could download tweets from major metropolitan areas. They then had to determine what actionable information was inside this big data volume, which included Twitter users' geographic locations.
"One thing we realized was that people often tweet about the state of their health," said Kautz. "They'll report they have a running nose. They have a cold. They're not feeling well. We said, 'Can we use this to track seasonal flu?'"
The group began training a series of machine-learning algorithms, starting with a few hundred tweets that were "hand-labeled examples, as in 'these are tweets about feeling sick,'" Kautz said.
The resulting system was able to determine with 99% accuracy whether a given Twitter user was reporting a flu-like illness. In fact, the automated, real-time model was nearly as accurate as humans who analyzed the text, and faster than the Centers for Disease Control (CDC).
"From this data, we can track the spread of seasonal flu, and do so with very good accuracy-- comparable accuracy that you get with the CDC data," Kautz said.
The success of TwitterHealth has led some students who work on the project to launch a startup company, which has licensed the technology from the university. Their goal is to take the same algorithmic approach to track other types of trends.
"There are commercial applications. Instead of health reports, it might track people's interest in fashion... and how ideas about popular culture spread from place to place," said Kautz.
But Kautz is particularly intrigued by the technology's potential in healthcare. "Gathering health data by surveys is very slow and expensive," he said. TwitterHealth also shows promise as a way to combat depression and suicide, and as a health alert system for cities.
"From analyzing these data sets, we've discovered that if you are on certain streets, and spend time in certain restaurants and at certain places, that greatly increases your chance of getting the flu," said Kautz.
The public nature of Twitter posts make them ideal for big data analytics, but Facebook's more private approach to social networking poses a problem. One option is to convince Facebook users to sign up for a TwitterHealth-style service; another is to convince Facebook to provide access to its members' private posts.
"One of our students has had some conversations with Facebook, but there's nothing that's come to fruition yet--maybe in the future," said Kautz.
InformationWeek is conducting a survey on the state of analytics, business intelligence, and information management deployments. Take our InformationWeek 2013 Analytics, Business Intelligence, And Information Management Survey now. Survey ends Oct. 12.