Google Data, Statistics, and the Semanticized Web

I imagine that Google employs many hundreds of data scientists, folks whose job is to study and turn to good use the huge masses of data the search-advertising-application services giant, and its users, generate. Each document indexed, each search, each ad served, each service call creates data. This data is used to create a better Google: easier to use, faster, more accurate, effective, and functional, and yes, more profitable. Google Instant, out a couple of weeks ago, is a latest initiative

Seth Grimes, Contributor

September 20, 2010

5 Min Read

I imagine that Google employs many hundreds of data scientists, folks whose job is to study and turn to good use the huge masses of data the search-advertising-application services giant, and its users, generate. Each document indexed, each search, each ad served, each service call creates data. This data is used to create a better Google: easier to use, faster, more accurate, effective, and functional, and yes, more profitable. Google Instant, out a couple of weeks ago, is a latest initiative toward these ends. I'm salivating (figuratively) at the thought of the new data it generates.
This article uses Google Instant as a jump-off point, so if you're not yet familiar with Instant, the video mashup Google Instant with Bob Dylan will show you how it works, worth 46 seconds of your time.Googlers Alon Halevy, Peter Norvig, and Fernando Pereira wrote last year in The Unreasonable Effectiveness of Data,

The first lesson of Web-scale learning is to use available large-scale data rather than hoping for annotated data that isn't available. For instance, we find that useful semantic relationships can be automatically learned from the statistics of search queries and the corresponding results or from the accumulated evidence of Web-based text patterns and formatted tables, in both cases without needing any manually annotated data.

This and similar statements -- and they certainly fit my worldview as an analytics guy -- caused a bit of stir. They were rightfully seen as a dig at hyper-annotated Semantic Web, not that SemWeb, unlike the Google and Bing and Search Engine X semanticized web, exists in any significant usable form, and not that any right-thinking semanticist isn't interested in and respectful of Semantic Web technologies and data assets -- witness Google's acquisition of Metaweb, not that Freebase is truly part of any large-scale Semantic Web. Lots of nots there.

How would you capture, in mark-up, statistics about:

  • The number of characters a searcher types before taking advantage of Google Suggest's as-you-type query completions -- "Did You Mean" delivered on-the-fly -- and choosing a suggested search query.

  • Typing patterns! How fast do searchers type certain words, common or uncommon, and particular sequences of characters? I could see this data as useful in password-security analyses.

  • Do people correct themselves in response to suggestions and Instant results? Does this correlate with operating system... e.g. so that the Android on-screen keyboard can be improved?

  • How do use of suggestions and of Instant results vary by search and by searcher demographics as known (if the searcher is logged in), by detectable information (browser and operating system, location indicated by IP-address or reported by a mobile device), or inferred from query type (searches, say, on "menstrual cramps" vs. "Rogaine"

There is no end of interesting questions, and of course the search-engine optimization folks are working them from the outside, seeking to reverse engineer the Google experience. Complicating the job for SEO mavens is that the service tailors suggestions to the searcher's location. These features are, again, data driven. Google is making strong efforts to deliver location-sensitive search results, guided I'm sure by statistical analysis of past, location-classified search interactions. Further, the volume of this data is huge. The overhead implied by highly verbose RDF triple storage of semantically uniform data -- data tables, nowadays in a columnar, parallel DBMS, are the way to go for big data, or try a hybrid such as Google's Bigtable -- renders analysis impossible.

Facts are situational. How they are filtered, framed, and presented does and should depend on user intent. A Web of Linked Data? Yes, that would be great. True data can be described in terms that are, for all practical purposes, universal and invariant. But human communications -- both the medium and the message -- the means of delivering and receiving information (including, for instance, the number of characters typed in a query) and the information content itself -- are both reliant and dependent on context and the A Semantic Web that prejudges user intent

The description of a talk given last Friday by Unreasonable Effectiveness co-author Peter Norvig, director of research at Google, includes the text:

In the modern day, it is more common to think of language modeling as an exercise in probabilistic inference from data: We observe how words and combinations of words are used, and from that build computer models of what the phrases mean. This approach is hopeless with a small amount of data, but somewhere in the range of millions or billions of examples, we pass a threshold, and the hopeless suddenly becomes effective, and computer models sometimes meet or exceed human performance.

I will relate this blurb back to Google Instant by citing Norvig's comment on Daniel Tunkelang's Noisy Channel blog, "If they want the head, a two-word query is sufficient conversation; if they want the tail it will take more." That was in April 2009; perhaps Norvig would, this month, rephrase "If they want the head, a two-letter query is sufficient." The data will tell. Folks who see the immense inherent business value in the many, many questions like this will work it to find out.

If you haven't already, please check out Smart Content, a conference I'm organizing, October 19 in New York. Smart Content focuses on content analytics, which I'll define as the application of semantic and analytical technologies to create findable, reusable, enriched content -- news, social, and enterprise -- boosting business value for content producers and consumers alike. We've extended the early-bird registration discount through Friday, September 24. I'm grateful, by the way, for Intelligent Enterprise's media sponsorship!

Read more about:

20102010

About the Author(s)

Seth Grimes

Contributor

Seth Grimes is an analytics strategy consultant with Alta Plana and organizes the Sentiment Analysis Symposium. Follow him on Twitter at @sethgrimes

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights