Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.
October 25, 2013
3 Min Read
Google made its name by counting online links as votes for the most relevant answer to search queries. In response, Internet users began gaming Google's election count by voting early and often -- creating extra links to make their websites rank higher in Google's index -- and Google was forced to take countermeasures to defend against manipulation.
Yet the company had to learn this lesson again with its Flu Trends website. Google created Flu Trends in 2008, based on the insight that searches about the flu have some correlation with the number of people dealing with the flu.
"[I]f we tally each day's flu-related search queries, we can estimate how many people have a flu-like illness," the company said in 2008 when it launched the service.
Google's laudable goal was to provide people with more timely information about the spread of the flu than traditional epidemiological surveillance data compiled by the Centers for Disease Control. But the company had to revise its approach to ensure that its data, in addition to being timely, is accurate.
[ Google is upping its online shopping game. Read Google Offers Shoppers Same-Day Delivery. ]
During the 2012-13 flu season, Google Flu Trends got it wrong. As the company documents in a recently published analysisof its approach to disease tracking, Google overestimated the incidence of flu in the U.S. by more than six percentage points, almost six times higher than the highest estimation error seen since the site launched. In the week of Jan. 13, 2013, Google put the incidence of flu in the U.S. at 10.56% of the population. The CDC put the number at only 4.52%.
What went wrong? Two words: The media. Google says it has concluded that its disease-detection algorithms "were susceptible to heightened media coverage."
This probably wasn't a difficult conclusion to reach because Google has been aware of the problem since it launched Flu Trends. Following the website's debut in 2008, the New York Timespublished an article about Google Flu Trends and included an example query that Google was actually monitoring in its flu prediction model. As a result, many Internet users tried that search term, driving up query volume and skewing Google's results.
The lesson here is rich with irony: To effectively assess data from a public source, the algorithm must remain private, or someone will attempt to introduce bias.
Google has been relying on "spike detectors" to compensate for surges of "inorganic" search traffic. But it turns out that Google underestimated the influence of the media. The company anticipated that search query spikes would last three days to a week. During the 2012-13 flu season, they lasted for months.
Google also notes in its analysis that it did not update its flu prediction model annually because the one built in 2009 had been performing well.
So to make its flu forecast more accurate, Google adjusted its spike detection algorithm to better assess the influence of the media. It also modified its algorithm by applying a statistical method called Elastic Net. Using these techniques, the variance between Google Flu Trends and CDC data last season would only have been about one percentage point.
Google Flu Trends is likely to remain a useful complement to traditional epidemiological surveying. But Google and other companies looking to leverage data harvested from the Internet might need to start treating what they gather not as low-hanging fruit but as something already poisoned.
About the Author(s)
Editor at Large, Enterprise Mobility
Thomas Claburn has been writing about business and technology since 1996, for publications such as New Architect, PC Computing, InformationWeek, Salon, Wired, and Ziff Davis Smart Business. Before that, he worked in film and television, having earned a not particularly useful master's degree in film production. He wrote the original treatment for 3DO's Killing Time, a short story that appeared in On Spec, and the screenplay for an independent film called The Hanged Man, which he would later direct. He's the author of a science fiction novel, Reflecting Fires, and a sadly neglected blog, Lot 49. His iPhone game, Blocfall, is available through the iTunes App Store. His wife is a talented jazz singer; he does not sing, which is for the best.
You May Also Like
Edge Computing's value to IT
Integrations to automate your framework compliance: ISO 27001, SOC 2, and NIST CSF
Edge Computing Bridges IT and OT People, Process, and Technology
A revolution in healthcare IT service management: How automation is driving improvements in a complex environment
Key Lessons for Enterprise Service Management