Mobile // Mobile Applications
News
4/14/2008
04:21 PM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Google Comes Knocking In Search Of Hidden Data

By crawling using HTML forms (and abiding by robots.txt), Google claims it leads search engine users to documents that otherwise would not be easily found -- but privacy concerns remain.

Google on Friday said that it has been testing ways to index data that is normally hidden to search engine crawlers, a change that should improve the breadth of information available through Google.

The so-called "hidden Web" that Google has begun indexing refers to data beyond static Web pages, such as Web pages generated dynamically from a database, based on input such as might be provided through a Web submission form.

"This experiment is part of Google's broader effort to increase its coverage of the Web," Google engineers Jayant Madhavan and Alon Halevy said in a blog post. "In fact, HTML forms have long been thought to be the gateway to large volumes of data beyond the normal scope of search engines. The terms Deep Web, Hidden Web, or Invisible Web have been used collectively to refer to such content that has so far been invisible to search engine users. By crawling using HTML forms (and abiding by robots.txt), we are able to lead search engine users to documents that would otherwise not be easily found in search engines, and provide Webmasters and users alike with a better and more comprehensive search experience."

Robots.txt is a file Web publishers place on their servers that specifies what data can or can't be accessed by crawling programs, should those programs chose to abide by its rules.

In their post, Madhavan and Halevy twice mention that Google follows robots.txt rules, perhaps to allay fears that Google's more curious crawler will expose sensitive data. Google's wariness of being seen as an invader of privacy is underscored by the fact that its two engineers characterize the Google crawler as "the ever-friendly Googlebot."

"Needless to say, this experiment follows good Internet citizenry practices," Madhavan and Halevy said in their post. "Only a small number of particularly useful sites receive this treatment, and our crawl agent, the ever-friendly Googlebot, always adheres to robots.txt, nofollow, and noindex directives. That means that if a search form is forbidden in robots.txt, we won't crawl any of the URLs that a form would generate. Similarly, we only retrieve GET forms and avoid forms that require any kind of user information."

Given that Google has and continues to be accused of disregarding privacy concerns -- a charge it has and continues to rebut -- such prudence is quite understandable.

In a 2001 paper, Michael K. Bergman, CTO of BrightPlanet, estimated that the hidden Web was 400 to 550 times larger than the exposed Web. Though it's not immediately clear whether this ratio still holds after seven years, Google's decision to explore the hidden Web more thoroughly should make its massive index even more useful, and perhaps even more controversial.

Indeed, not everyone has been won over. In a blog post, Robin Schuil, a software developer at eBay, criticized what Google was doing for creating an extra burden on sites.

He said it's "really awfully close to what some of the search engine spammers do: targeted scraping of Web sites."

Comment  | 
Print  | 
More Insights
Building A Mobile Business Mindset
Building A Mobile Business Mindset
Among 688 respondents, 46% have deployed mobile apps, with an additional 24% planning to in the next year. Soon all apps will look like mobile apps and it's past time for those with no plans to get cracking.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - August 27, 2014
Who wins in cloud price wars? Short answer: not IT. Enterprises don't want bare-bones IaaS. Providers must focus on support, not undercutting rivals.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Howard Marks talks about steps to take in choosing the right cloud storage solutions for your IT problems
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.