The reliability of modern data mining and analytics approaches in predicting the progression of deadly diseases is being fully tested by the Ebola outbreak in West Africa.
Since the current outbreak began in March, most of the official forecasts on Ebola have, not surprisingly, come from organizations such as the US Centers for Disease Control (CDC) and the World Health Organization (WHO). Both the CDC and WHO have relied mostly on conventional epidemiological approaches and measures to arrive at their estimates of how far and how quickly the disease is likely to spread.
[Flexible enough for Ebola: EHRs Must Find Zebras Among The Horses.]
WHO's grim forecasts this week, for instance, of 70% fatality rates and 1,000 new infections per week, are based on data gathered about people who have died from or are reporting Ebola-like symptoms from hospitals and clinics in Sierra Leone, Liberia, Guinea, and Nigeria.
The organization has been supplementing such data with more informal reports gathered from medical diagnostic facilities and burial grounds in the affected region to try and build a more complete picture of the nature and scope of the deadly threat.
Similarly, the CDC's EbolaResponse models are based on information gathered about patients in various stages of the Ebola lifecycle, including those who are susceptible, infectious, incubating, or recovering from the disease.
But with the outbreak threatening to become a global pandemic, some are attempting to see if they can better predict where the disease might be spreading next by looking at more real-time data elements as well.
One example is HealthMap, a disease-monitoring website maintained by a team of researchers and epidemiologists at Boston Children's Hospital. The site provides early detection and real-time surveillance on emerging health threats by aggregating and analyzing information from multiple sources, including social media streams, online news stories, official reports, travel sites, and official sources, such as WHO.
HealthMap hit the news earlier this year when it became one of the first sites anywhere to pick up mentions of the current Ebola outbreak. On March 14, about nine days before government authorities in Guinea had even informed the World Health Organization of the outbreak, HealthMap had picked up mentions of the first new infections from a local newspaper in Macenta prefecture in Guinea.
Another example is an effort by researchers at Germany's Humboldt University to predict the likelihood of Ebola spreading to different countries by looking at global air transportation networks and how passengers might use them to travel between countries. The researchers have developed a model that computes the relative risk of Ebola entering a specific country based on what they call the "effective distance" by air of that country from major Ebola infection zones.
Instead of looking only at the distance between two countries, the model also looks at other measures -- like the frequency of air traffic, transit points, and the number of travelers between specific airports in different countries -- to calculate the likelihood of Ebola entering the country. "In a nutshell, places that exchange lots of passengers are closer than places that exchange a few," the site reads.
So Paris's Charles de Gaulle airport has a higher relative risk of importing Ebola from Conakry, Guinea than New York's John F Kennedy because it is closer, processes more daily passengers, and is served by a direct flight, among other factors.
Meanwhile, a team of Oxford University scientists in the United Kingdom is trying to predict where future outbreaks of Ebola might happen in Africa by looking at animal populations in which the virus may be residing. According to a description of the effort, researchers at Oxford looked at aggregated data involving reports of human and animal infections since 1976 to see if there are any common factors such as temperature, vegetation, or geographic factors linking them together.
"The researchers were then able to create a map identifying similar areas where the virus is likely to be carried by animals and there is a risk of transmission to humans triggering future outbreaks," the description noted.
The jury is still out on the reliability of such methods in predicting the progression of infectious diseases like Ebola, according to Kalev Leetaru, a Yahoo fellow-in-residence at Georgetown University and a member of the World Economic Forum's Global Agenda Council on the Future of Government. Looking at social media streams, news reports, herd migrations, travel patterns, shipping routes, and other unconventional data can provide early information on the arrival or spread of an infectious disease.
But there are caveats, Leetaru pointed out. Social media monitoring, for example, is of limited value or might provide an incomplete picture in areas with limited use of social media tools like Twitter and Facebook. Similarly, he added, many early mentions of a disease outbreak might be in a language other than English and therefore not always picked up by monitoring tools.
For instance, though HealthMap is credited for being the first to detect mentions of the current Ebola outbreak, Leetaru noted that government officials in Guinea had already publicly announced the outbreak a day earlier. "In the forested areas of Guinea it is unlikely that a lot of people are going about live-tweeting [details of the Ebola epidemic]," he said. Often, the information that does get out is via traditional government sources.
"Data modeling can be quite accurate and useful in predicting the outbreak of contagious diseases, but it must be continually refined and cross-checked," added Michael Hendrix, director of emerging research and issues at the US Chamber of Commerce Foundation. "Think of big data as offering a trip wire of sorts for alerting first responders," he said.
Reports on social media, while not always 100% reliable, often contain early warning signals well before official data sources, Hendrix noted. Such information can be useful in alerting of a new threat but works best only when combined with official data sources.
"Big data doesn't replace traditional data sources or surveillance networks in watching for outbreaks -- it helps make them better," Hendrix said. "And when the worst happens, data helps medical professionals and public health experts do their job better."
What will you use for your big data platform? A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? One size doesn't fit all. Here's how to decide. Get the new Pick Your Platform For Big Data issue of InformationWeek Tech Digest today. (Free registration required.)Jai Vijayan is a seasoned technology reporter with over 20 years of experience in IT trade journalism. He was most recently a Senior Editor at Computerworld, where he covered information security and data privacy issues for the publication. Over the course of his 20-year ... View Full Bio