The Perfect Search

Google-style search is all right for some, but greater accuracy in the enterprise demands a mix of techniques including content tagging and taxonomy development and technologies such as entity, concept and sentiment extraction tools.

Sometimes Web masters help determine relevancy. Google's search engine creates page ranks based on how frequently people link to a given piece of content. The downside to this is that most companies' documents and data sources have little or no record of content linking. "That this is lost on most people is a triumph of branding and makes page-rank-free Google somewhat akin to caffeine-free Jolt as a product," says Dave Kellogg, CEO of Mark Logic.

Clustering tools, such as those from Engenium or Vivisimo, create an ad hoc taxonomy by grouping search results into categories on the fly (search engines from Inxight Software and Siderean also cluster results). With clustering, a search for the term "life insurance" on an insurance company's site would display results grouped under headings such as Whole Life, Term Life and Employee Benefits. It's a fast and efficient way to categorize content, but it's not always accurate; there's no consistent set of categories, and the results can be strange because there's no human involvement.

Combining Search Tools

The next step to intelligent search is to apply text analytics tools. Several small companies are providing analytics software for entity, concept and sentiment extraction.

Sentiment extraction, or sentiment monitoring, the newest of these tools, tries to identify the emotions behind a set of results. If, for example, a search uncovers 5,000 news articles about the Segway, sentiment extraction could narrow the set down to only those articles that are favorable. Products from Business 360, Fast, Lexilitics, NStein and Symphony all provide sentiment extraction. IBM has layered NStein technology on its OmniFind enterprise search platform to support "reputation monitoring," so companies can know when their public image is becoming tarnished.

Entity extraction uses various techniques to identify proper names and tag and index them. Inxight and ClearForest are the two leading providers of entity-extraction software, and many search tools embed or work with their technology.

Concept search tools put results in context, as in Paris the city versus Paris the person or Apple the company versus Apple the fruit. These tools use natural-language understanding techniques to make such distinctions. Autonomy and Engenium are two vendors of concept search software.

Adding a Backbone

Assuming you need more than one search technology, how do you knit disparate solutions together? IBM's answer is Unstructured Information Management Architecture. Recently published on, UIMA is an XML standard framework whose source code is available to third-party search technologies. It acts as a backbone into which text analytics and taxonomy tools can be plugged.

UIMA may sound like a gimmick to promote IBM's OmniFind enterprise search product, but because its business is driven by services more than software, IBM is willing to pull in other, sometimes competing applications. "No single vendor can address all analytics needs or all requirements to understand unstructured information," says Marc Andrews, director of search and discovery strategy. "Companies need different analytics for different sets of content; [what's] relevant to the life sciences community will not be relevant to the financial services industry. And even within an organization, the analytics relevant to warranty claims and customer service data will be different from the analytics relevant to marketing, HR and generic interest."

UIMA provides a common language so search results can be interpreted by different applications or analytics engines. The framework defines a common analysis structure whereby any content — whether it be an HTML page, a PDF, a free-form text field, a blob out of a database or a Word document — can be pulled into a common format and sent to a search tool. Results are fed back into the analysis structure and passed along to the next search tool. The final results are output in a common format that any UIMA-compliant application can use.

Can UIMA become a universally accepted backbone that holds search tools together? Some think UIMA is on its way to becoming a de facto standard. So far, the Mayo Clinic, Sloan Kettering and the Defense Advanced Research Projects Agency are adopting the framework, and 15 vendors, including Attensity, ClearForest, Cognos, Inxight, NStein and Siderean, have agreed to make their search tools UIMA-compliant.

In a case of co-opetition, Endeca will support UIMA in an upcoming release of its enterprise search software even though the company competes with iPhrase, which was acquired last year by IBM. "UIMA will uncomplicate the world," says Phil Braden, Endeca's director of product management. "As more and more people adopt UIMA as the standard for how structured and unstructured data is supposed to look and how these components are supposed to integrate, it becomes that much easier to pull data from these different systems into Endeca."

There's little to challenge UIMA other than a couple of XML initiatives that also address the standardization of data formats for search engines. One such initiative is Exchangeable Faceted Metadata Language, an open XML format for publishing and connecting faceted metadata between Web sites, but that standard doesn't have the momentum of something being pushed by IBM.

Not every company, of course, will go to all the lengths described here to architect accurate search. For some, keyword search and placement of documents in well-labeled electronic folders will suffice. The sophisticated search pioneers are e-commerce sites, pharmaceutical companies and government agencies, which have the most to gain: greater sales, faster drug development, detection of terrorist activity. Call centers are getting search makeovers so that multiple search tools can mine unstructured content and databases together and give reps all the information they need to close calls. What could broader and more accurate searches achieve in your company?

Editor's Choice
Samuel Greengard, Contributing Reporter
Cynthia Harvey, Freelance Journalist, InformationWeek
Carrie Pallardy, Contributing Reporter
John Edwards, Technology Journalist & Author
Astrid Gobardhan, Data Privacy Officer, VFS Global
Sara Peters, Editor-in-Chief, InformationWeek / Network Computing