The Perfect Search - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Software // Information Management

The Perfect Search

Google-style search is all right for some, but greater accuracy in the enterprise demands a mix of techniques including content tagging and taxonomy development and technologies such as entity, concept and sentiment extraction tools.

If you want to find out what Brad and Angelina are up to, Google is a great search tool. Type in the celebrity names and poof, you get a list of the latest stories about the Brangelina baby-to-be. But if you need a technical or business-oriented search, Internet-style search technology doesn't cut it. Accurate enterprise search depends on intelligent use of state-of-the-art taxonomies, metatags, semantics, clustering and analytics that find concepts and meaning in your data and documents.

The idea that the enterprise can't be searched like the Web sounds foreign to many business executives. "Why can't we use Google?" says the CEO. IT obediently buys Google's search appliance, turns it on and the problem is solved. Or is it? "For some companies, Google is fine," says Laura Ramos, Gartner analyst. But where many repositories of non-Web content and documents need to be searched or critical information must be found quickly, companies need to design searches that approximate human reasoning.

No one product can do this. But by mixing and matching the latest taxonomy, clustering, and entity, concept and sentiment extraction tools, you can get close. What's helping is the rise of XML: As more companies realize the benefits of reading and sharing information in standard XML formats, such as RDF, ebXML and XBRL, more products roll out to convert documents, databases and other content into XML. The information provided in XML tags and formatting brings a level of intelligence about documents and content hitherto unavailable. Next-generation search technologies are taking advantage of XML formatting and metadata to provide searches informed by insider information and structure.

Structuring Content

The main trend adding power to enterprise search is the increase in semistructured information: content that has some kind of structure to it, generally through the use of metatags that describe content. E-mail, which is structured with "To," "From" and "Subject" fields, is one example of semistructured data. XML is also expanding the universe of semistructured content as industries adopt XML schema, such as ACORD (Association for Cooperative Operations Research and Development) for XML in the insurance industry and XBRL (Extensible Business Reporting Language) in the financial services arena. Such schema help businesses exchange and analyze data in a standardized way.

Basic structure is provided by metatagging. An author or software program identifies the elements of a document, such as headline, abstract, byline, first paragraph, second paragraph and so on to modestly improve search results.

Using content structure in the display of search results is useful. If a search engine can present the headline, abstract, graphics and the first and last paragraphs of an article, the user gets a good idea of what it's about — much better than the typical document "snippet" that's often no use at all.

A few vendors are using XQuery, a command-oriented, SQL-like standard for creating search statements, to exploit the structure of XML-tagged content. Mark Logic, for example, converts documents and databases to XML, provides structural metatagging, and indexes the content and tags in a database where they can be mined by a variety of text analytics tools. Similarly, Siderean Software's Seamark Metadata Assembly Process Platform converts unstructured and structured data to RDF (Resource Description Framework), generates metadata such as page title and date; and organizes the content and tags into relational tables. Entity and concept extraction can be applied to create tags, and metatags can be suggested to an editorial team, which can approve and refine them. Content and metadata are then pulled into a central repository where they can be organized according to corporate vocabularies or ontologies and mined using the tagging results.

Building Taxonomies

With metatags and some structure in place, the next logical step to improving an enterprise search is to build a taxonomy. For as long as search technology has existed, it's been obvious that the first step toward getting something more accurate than 500,000 useless hits is to create context or navigation for the search, such as a taxonomy — a classification according to a predetermined system. A taxonomy can be as basic as organizing documents by month or client, or it can be a sophisticated scheme of concepts within topics.

"Categorization lets you sharpen the search and do concept-based retrieval as well as browsing," says Sue Feldman at IDC. "It lets a user answer questions that can't be answered by search alone, such as, 'What's in this collection?' or 'I'm interested in going on a vacation, but I don't know where; what are some interesting places?'"

With a taxonomy in place, users can browse through categories and discover information they need but didn't know how to look for (indeed, few people understand how to write effective search queries or ask the right questions of a search engine). The tricky part is deciding who will build the taxonomy. Who is willing, able and blessed with sufficient free time to decide what the structure should be and where each new piece of content fits in?

The most straightforward answer is to have authors categorize and apply the proper metatags and keywords to their content. Publishers of magazines and technical publications, for instance, take a structured-authoring approach using marked-up templates. But this laborious practice is not for everyone, and in a typical company, most users lack the time and inclination to fill out forms describing each document.

A more lightweight method of categorizing, called "folksonomy," is becoming popular on the Web, where sites like Flickr and provide those submitting photos or lists with easy-to-use tools to annotate their content. "By combining annotation across many different distributors, you gain insight into useful information and get around some [of the] problems with more traditional approaches to metadata management," says Brad Allen of Siderean.

With an active community of users assigning categories and metatags, valid new terms, initiatives and projects are easily added to the existing taxonomy, making it more dynamic than a rigid taxonomy created by a librarian. "It's sloppy and it's chaotic, but the degree to which it improves precision in the retrieval process can be quite significant," Allen says.

Formal taxonomies are usually created by a librarian or cataloger trained in library science. This can be effective, but it's expensive, time-consuming and hard to keep up-to-date.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
1 of 2
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Becoming a Self-Taught Cybersecurity Pro
Jessica Davis, Senior Editor, Enterprise Apps,  6/9/2021
Ancestry's DevOps Strategy to Control Its CI/CD Pipeline
Joao-Pierre S. Ruth, Senior Writer,  6/4/2021
IT Leadership: 10 Ways to Unleash Enterprise Innovation
Lisa Morgan, Freelance Writer,  6/8/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Flash Poll