Last year, Google Flu Trends made a mountain out of a molehill by overestimating the incidence of influenza. Blame the media.
Google made its name by counting online links as votes for the most relevant answer to search queries. In response, Internet users began gaming Google's election count by voting early and often -- creating extra links to make their websites rank higher in Google's index -- and Google was forced to take countermeasures to defend against manipulation.
Yet the company had to learn this lesson again with its Flu Trends website. Google created Flu Trends in 2008, based on the insight that searches about the flu have some correlation with the number of people dealing with the flu.
"[I]f we tally each day's flu-related search queries, we can estimate how many people have a flu-like illness," the company said in 2008 when it launched the service.
Google's laudable goal was to provide people with more timely information about the spread of the flu than traditional epidemiological surveillance data compiled by the Centers for Disease Control. But the company had to revise its approach to ensure that its data, in addition to being timely, is accurate.
During the 2012-13 flu season, Google Flu Trends got it wrong. As the company documents in a recently published analysisof its approach to disease tracking, Google overestimated the incidence of flu in the U.S. by more than six percentage points, almost six times higher than the highest estimation error seen since the site launched. In the week of Jan. 13, 2013, Google put the incidence of flu in the U.S. at 10.56% of the population. The CDC put the number at only 4.52%.
What went wrong? Two words: The media. Google says it has concluded that its disease-detection algorithms "were susceptible to heightened media coverage."
This probably wasn't a difficult conclusion to reach because Google has been aware of the problem since it launched Flu Trends. Following the website's debut in 2008, the New York Timespublished an article about Google Flu Trends and included an example query that Google was actually monitoring in its flu prediction model. As a result, many Internet users tried that search term, driving up query volume and skewing Google's results.
The lesson here is rich with irony: To effectively assess data from a public source, the algorithm must remain private, or someone will attempt to introduce bias.
Google has been relying on "spike detectors" to compensate for surges of "inorganic" search traffic. But it turns out that Google underestimated the influence of the media. The company anticipated that search query spikes would last three days to a week. During the 2012-13 flu season, they lasted for months.
Google also notes in its analysis that it did not update its flu prediction model annually because the one built in 2009 had been performing well.
So to make its flu forecast more accurate, Google adjusted its spike detection algorithm to better assess the influence of the media. It also modified its algorithm by applying a statistical method called Elastic Net. Using these techniques, the variance between Google Flu Trends and CDC data last season would only have been about one percentage point.
Google Flu Trends is likely to remain a useful complement to traditional epidemiological surveying. But Google and other companies looking to leverage data harvested from the Internet might need to start treating what they gather not as low-hanging fruit but as something already poisoned.
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.