Google Computes News Quality

A recently filed patent application suggests that Google is taking steps to promote news produced by major media companies on Google News.

Thomas Claburn, Editor at Large, Enterprise Mobility

November 7, 2009

7 Min Read

Whenever a newspaper dies, Google turns up on the list of suspects.

The evidence of Google's involvement tends to be sketchy. A close examination of the crime scene typically points to a different villain -- the classified revenue killer known as Craigslist, parasitic news sites that siphon potential visitors, declining subscription and ad revenue, management that can't adapt, or the hyper-competition created by the Internet's ability to collapse distance and divert attention.

But Google nonetheless has been forced to defend itself. In May, Marissa Mayer, Google's VP of search products and user experience, testified before the Senate Subcommittee on Communications, Technology, and the Internet that Google is the hand that feeds media companies, channeling over one billion clicks every month to online publishers through Google Search and Google News.

She suggested that Wikipedia, with its constantly updated articles, might offer a better model for journalism in the Internet age than a series of separate articles. And she proposed that online publishers might be failing their readers by presenting them with Web pages that lack engaging social features.

At the Web 2.0 Summit panel discussion last month, Mayer delivered a similar message, that Google is friend and not foe to publishers.

But Mayer's olive branch was rebuffed by fellow panelist Robert Thomson, managing editor of Wall Street Journal. "Google wants to be the home page. It wants to be the front page. And Marissa unintentionally encourages promiscuity," he said.

It was a provocative instance of metonymy -- Mayer standing for Google -- that impugned by double entendre, even as Thomson revealed the real gripe of the old guard: Google News threatens the newspaper editor as the arbiter of what's newsworthy.

It's about the money too, of course, but that follows from influence and respect. News organizations that spend a lot on reporting don't want to be lumped into the same basket as all the other news outlets that write reports based on their reporting. They chafe at the sight of bloggers who quote their reports liberally, add two cents, and collect more than that in Google AdSense revenue.

Google has been addressing these issues for years now. Last week, it took another step to improve the quality of Google News through its guidelines for Google News Sitemaps, files published by news sites to help Google index their content. The revised guidelines require that news publishers label content PressRelease, Satire, Blog, OpEd, Opinion, or UserGenerated, if appropriate.

Google doesn't explain why it wants this information. But presumably these identifiers can be used to help make sure a blog post or Wikipedia entry isn't being featured in a prominent position on Google News at the expense, say, of a Wall Street Journal article.

A Google patent application filed last week provides a clearer explanation of the company's goals for Google News.

"Systems And Methods For Improving The Ranking Of News Articles" explains that Internet users rely on search engines to find news, but "the news sources associated with these hits, however, may not be of uniform quality. For example, CNN and BBC are widely regarded as high quality sources of accuracy of reporting, professionalism in writing, etc., while local news sources, such as hometown news sources, may be of lower quality."

The patent application describes a way to compute the quality of a news article.

Google recently launched a Google News section called Spotlight that attempts to emphasize "news and in-depth pieces of lasting value." The articles appearing in Spotlight could be, to some degree, the product of the technology described in Google's patent application.

This is what Google had to say about its patent claim: "We file patent applications on a variety of ideas that our employees come up with. Some of those ideas later mature into real products or services, some don't. Prospective product announcements should not necessarily be inferred from our patent applications. As for Google News, stories are selected and ranked by computers based on more than a hundred factors, including the freshness, location, relevance and diversity of their content. We're always looking for ways to make the algorithm even better."

What does Google measure to determine whether the quality of an article? The patent application describes the following metrics:

* The number of articles produced by the news source during a given time period.

This may be counted in terms of the number of articles a news source produces in a set period or the number of original sentences.

* The average length of an article from the news source.

Article length matters, though it's not clear whether article length matters in all cases or only when article length exceeds a publication's average length.

* The importance of coverage from the news source.

A measurement of a story's significance based on the number of other news sources covering the story.

* The breaking news score.

The patent applications explains that "the breaking score is a number that is a high value if the article was published soon after the news event happened and a low value if the article was published after much time had elapsed since the news story broke." This threshold value appears to vary.

* The usage pattern.

Google measures how much traffic gets referred to news articles through Google News and generates a value that reflects story popularity. The patent application suggests Google may normalize that value based on opportunities to click on a link rather than actual clicks, in order to compensate for the natural tendency to click on familiar news brands.

* Human opinion.

Google has acknowledged that human evaluation plays a role in the prominence of news articles on Google News. Among the factors weighed are opinion polls, evaluations by other news organizations, as measured by Pulitzer Prizes, for example, and the age of a publication.

* Circulation statistics.

Online and offline circulation figures may be used as a metric for article quality determination.

* The size of the staff associated with the news source.

This appears to be based on a count of the number of different bylines at a given publication. In June, Google began highlighting the names of journalists in Google News stories.

* The number of news bureaus associated with the news source.

Google's patent offers no explanation as to how this number is computed.

* The number of original named entities the source news produces within a cluster of articles.

Google defines a named entity as a person, place, or organization. Google weighs named entities as a possible sign of original reporting. "If a news source generates a news story that contains a named entity that other articles within the same cluster (hence on the same topic) do not contain, this may be an indication that the news source is capable of original reporting," the patent application explains.

* The breath of coverage.

A measure of the range of categories -- technology, arts, politics -- covered by a news organization.

* International diversity.

A measure of the global popularity of a news source.

* Writing style.

A rating for articles based on spelling, grammar, and reading level.

These aren't the only metrics. As Google's spokesperson stated, there are over 100. Considered together, these metrics allow Google to determine which articles represent the kind of journalism that it wants to promote.

Google declined to comment on whether it was using the techniques described in its patent application. But it appears that Google News is increasingly favoring major news outlets, which presumably score well in algorithmic quality assessments.

According to News Knife, a site that tracks Google News articles, the top 20 news sites have been appearing more frequently on the Google News home page. In the period from September 2007 through August 2009, the top 20 news sites rose from 40% of the sites on the Google News home page to 60%.

Perhaps Thomson should thank Mayer for rekindling the interest of readers who'd deserted him.

Attend a Webcast on protecting enterprise assets from social networking threats. It happens Thursday, Nov. 12. Find out more and register.

About the Author(s)

Thomas Claburn

Editor at Large, Enterprise Mobility

Thomas Claburn has been writing about business and technology since 1996, for publications such as New Architect, PC Computing, InformationWeek, Salon, Wired, and Ziff Davis Smart Business. Before that, he worked in film and television, having earned a not particularly useful master's degree in film production. He wrote the original treatment for 3DO's Killing Time, a short story that appeared in On Spec, and the screenplay for an independent film called The Hanged Man, which he would later direct. He's the author of a science fiction novel, Reflecting Fires, and a sadly neglected blog, Lot 49. His iPhone game, Blocfall, is available through the iTunes App Store. His wife is a talented jazz singer; he does not sing, which is for the best.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights