Surpassing even E-mail, search has become the Internet's killer app. Many PC users don't go a day without logging on to Google. And selling ads linked to search results has become a growth business on the Web. Just this month, Microsoft previewed software it plans to test this year that links Web searches to ads, joining Google and Yahoo. Last week, IAC/InterActiveCorp, Barry Diller's E-commerce company, bought search-engine company Ask Jeeves Inc. for $1.9 billion. Inside companies, workers spend perhaps 30% of their time looking for information, according to an IBM estimate.
But what works for finding things on the Web--the keyword search engines offered by America Online, Google, Microsoft, Yahoo, and others--often fails inside companies' networks, which contain not only familiar Web pages, Office documents, and Adobe files, but more obscure data that lives in specialized mainframe databases or CAD systems, some of which date back decades.
Google may have the market lead looking for Web pages, but fast-growing business and government investment in emerging IT areas such as Internet phone calls, electronic medical records, and anti-terrorism technology is driving demand for new ways of searching digital information. The goal is to extract information from databases, Web pages, documents, or audio and video clips automatically; recognize the names of people, places, organizations, dates, and dollar amounts; and find the relationships among them. Mining sounds and images for meaning is also important as companies expand call centers and switch to Internet-based phone calls and as the government pours money into IT for intelligence and homeland security.
"We have very good systems for counting how often keywords appear on pages and using that to rank documents," says Christopher Manning, assistant professor of computer science and linguistics at Stanford University. But search algorithms still can't glean the same context that people can with just a glance. "Human beings can get an enormous amount of information by looking at a three-line snippet of a page that somehow computers aren't getting," he says. New research aims to help computers understand what they're missing.
The first place most PC users still look for search results is Google or Yahoo. Their keyword search model works well on the Internet because it uses the links people build between Web pages as votes for relevance. It also can be used across corporate networks of computers, even though business documents aren't linked together like Web pages. The problem is, having become conditioned to Google's simplicity and speed, most people expect the most relevant information right away, with minimal effort. Inside business networks, that equation doesn't always compute.
Google's PageRank algorithm relies on more than 100 variables to decide what's relevant, Girouard says.
Google's answer so far is a special server "appliance" for companies that can index their data and expose it via Google's familiar user interface. In response to competitors who say the company's bread-and-butter PageRank algorithm doesn't work as well for data as for Web documents, Google Enterprise general manager Dave Girouard says PageRank relies on more than 100 variables to decide what's relevant, and only one of those measures link structure. For businesses that buy the search appliance, other variables are given more weight. That means Google can serve both the mass market of PC users and enterprise customers who buy its data-searching servers. "A lot of people dismiss it entirely," Girouard says of PageRank, "but it is certainly of value."
Google is working on algorithms that can analyze audio files and video clips. It's also refining software that can sort data from different IT systems into easy-to-understand categories, a technique used on its Google News site.
There's certainly no shortage of data for businesses to reckon with. According to a 2003 study by the University of California, Berkeley's computer-science school, the volume of data on the Web tripled between 2000 and 2003, from less than 50 terabytes to 167 terabytes. In 2002, print, film, magnetic, and optical storage media yielded about 5 quintillion (that's 5 times 10 to the 18th power) bytes of new data, 37,000 times the amount of information in the Library of Congress. The trend--30% annual growth in the volume of information produced--shows no sign of slowing.