Search in Focus: Implementing a Taxonomy

Search engines don't know the difference between reading glasses and drinking glasses, but a taxonomy puts your query in context. We outline several ways to build taxonomies, ranging from the tough but potentially more accurate approach of building from scratch to the easier but potentially compromised approach of buying a prebuilt taxonomy or using automated clustering software.

Penny Crosman, Contributor

November 21, 2006

17 Min Read
InformationWeek logo in a gray background | InformationWeek

Mention the word "taxonomy" and some people will think you mean stuffing dead animals (as in taxidermy). Although the taxonomy may not be well known, taxonomies (or sets of categories) are used to organize quantities of information on the Internet, in portals and in enterprise data repositories. Taxonomies bring context to words, topic areas and search results.

Finding a piece of information within a large collection of data without a taxonomy is like driving in unknown territory without the benefit of a map or road signs: You may eventually stumble upon your destination, but chances are you'll encounter a lot of dead ends and detours first. A taxonomy provides a hierarchical structure of categories, from general to specific. In biology, for instance, dogs are classified under the kingdom Animalia, the phylum Chordata, the class Mammalia, the order Carnivora, the family Canidae, the genus Canis, and the species Canis familiaris.

When combined with metatagging tools, text analytics and search software, enterprise taxonomies support accurate search and guided navigation that could not be achieved with search engines alone. As data volumes increase, so, too, does the need for taxonomy. If you have 100 documents, almost any search technique will work, but if you have a terabyte worth of documents, you need sophisticated search guided by a taxonomy.

We outline several ways to build taxonomies, ranging from the tough but more potentially accurate approach of building from scratch to the easier but potentially compromised approach of buying a prebuilt taxonomy or using automated clustering software. We also examine deployment and ongoing maintenance practices, as well as the role of ontologies, which might come into play in merger and acquisition scenarios.

ASSESS THE NEED

An enterprise taxonomy attempts to classify virtually all information in an organization and brings it under one structure. Despite the many benefits (see "10 Good Reasons To Use a Taxonomy"), building a enterprise-wide taxonomy is easier said than done. Inevitably, each department has its own priorities, terminology and preferred structure for its body of information, so it's hard to get everyone to agree on one core set of categories. "Customers say this takes a long time, and they talk about people in a room yelling at each other," says Fern Halper, a partner at the research and consulting firm Hurwitz & Associates.

In some settings, universal taxonomies are an absolute must. At the Department of Homeland Security and public safety agencies, for example, taxonomies help tie together clues, establish relationships between crucial tidbits of information and spot broader security or safety threats.

Your company may or may not need an organization-wide taxonomy depending on the problems you're trying to solve. "If your application is simply to enable better retrieval of documents or better kinds of communication with structured data in databases, it may not be necessary," says Josh Powers, principal ontologist at search vendor Convera. "But if your goal is better communication throughout the company, you need to come to some agreement."

When it's time to build, there are two approaches: the tough road of trying to create and enforce a taxonomy through task forces, management edicts, training and so on; or the appeasement route, in which you create mappings between differing points of view. If the sales organization looks at the market in a different way than the product management group, you would choose the latter approach, and automated mappings could reconcile the two views with a central taxonomy (perhaps with the aid of an ontology, but more on that later).

REACH CONSENSUS

If you want the consistency of a taxonomy that applies across the organization, you must reach companywide consensus on what terms should be used and how information should be organized. When Factiva consults with companies on taxonomy development, it offers workshops on determining the business need, articulating the value proposition, identifying the return on investment, setting boundaries, building a team, achieving consensus, choosing a software vendor and establishing best practices.

To get the most accurate results, information-intensive businesses rely on teams of trained taxonomists who debate the merits of various terms and headings before finalizing distinctions and relationships. Graeme McCracken, COO of Reed Business Search, which employs about 20 taxonomists, tells of a meeting in which taxonomists discussed the difference between chicken and chickens. "They came to the conclusion that chicken is dead and chickens are alive," he says. Such discussions are an important part of the process. Reed has developed three taxonomies and a product and service ontology with more than 200,000 core terms (see "Field Report").

Taxonomists often need to consult with subject-matter experts, the people who create and use the content. At gas and chemical supplier Air Products and Chemicals, the innovation group (which combines R&D, marketing and technology staff) recently requested a taxonomy to help find relevant research and product information in several repositories as part of a larger initiative to capture and reuse knowledge. The company's four-person taxonomy team interviewed departmental staff, held weekly task-force meetings and gradually built a taxonomic framework. They also did extensive testing with subject-matter experts before taking the new taxonomy live. From a technology standpoint, the company has a dtSearch crawler for federated searching, and it used concept extraction, entity extraction and metadata generation software from Inxight Software in the taxonomy development effort.

A less time-consuming and more democratic (though sometimes less accurate) approach to developing categories is to use folksonomies or social bookmarking services (like Flickr.com and Del.icio.us.com). Here, contributors and users assign categories or tags to content as they see fit. Unless they're mischievous, careless with words or lazy, folksonomists can fulfill the critical role of the taxonomist. But be warned that the tags may not be consistent or normalized.

Dogear is a social bookmarking service designed for business use. It lets users create, organize and categorize bookmarks from Internet and Intranet sources alike. Developed by IBM Research labs, Dogear helps users filter and bookmark (essentially tag by category) large amounts of data for reuse by others in the enterprise. Users can share relevant and timely content via bookmarks, identify communities of interest and share expertise. IBM says Dogear will be available as a standalone Lotus offering and integrated into the Lotus portfolio in 2007.

BUY OFF THE SHELF

If you want to save time, prebuilt industry- and topic- centered taxonomies are readily available. For instance, the National Library of Medicine provides Medical Subject Headings (MeSH) that are used to index medical journals. Factiva offers the Taxonomy Warehouse, which provides prefab taxonomies (some for sale, some free) from sources ranging from publishers to The Library of Congress. Convera offers taxonomies on such subjects as genetics, finance and business, and technology.

Off-the-shelf taxonomies can be a great launching pad. "The head start of having a taxonomy that says, 'this is the organization, this is a standard way of looking at things,' always helps when you're looking at new information," Gartner analyst Rita Knox says.

Some companies customize prebuilt taxonomies to suit their specific needs. "As long as you're not thinking [a prebuilt taxonomy] is going to solve everything for you and you're willing to change it, it's a template to build on," says Halper of the Hurwitz Group. If the taxonomy is too detailed, parts of it can be suppressed or ignored. However, the MeSH and Library of Congress taxonomies can be intricate, warns practitioner Deborah Silverman, associate director for resource management at the University of Pittsburgh Health Sciences Library System. "If you play with them you almost always break them," she says, adding that the university uses these taxonomies just as they are.

Some companies adapt taxonomies developed by trade magazines or conference producers. These professionals have put time and effort into determining what the key topic areas are, so why not take advantage of their efforts?

Technology also is available to generate taxonomies by analyzing vast stores of documents and deciphering hierarchies of concepts. Autonomy, Convera, Endeca Technologies and Teragram all offer automated tools that can help you build, test and manage taxonomies, though all require at least a modicum of human guidance and input.

PUT IT TO WORK

A taxonomy is only useful if it can be consistently applied. There are two challenges in matching information to categories: tagging the item and then matching tags to the appropriate categories.

Tagging content so it can be placed in the right "buckets" can be a major chore. "Placing things into a taxonomy isn't as easy as it may seem, and it's certainly not just based on keyword occurrence, because the context of the keyword occurrence can be very misleading or cryptic," says Martin Boyd, vice president of marketing at Silver Creek, which offers software that cleans and categorizes product data. As an example, "product descriptions tend to be very short and very ambiguous."

Even when librarians use MeSH or Library of Congress headings, categorizing books is labor intensive, the University of Pittsburgh's Silverman says, even more so for those who have to build their own schema.

Automated tools can alleviate this chore. For instance, the word Columbia in a document might refer to a university, a record company or a small town in Maryland, but an entity tagger can analyze that document and guess at the context by looking at surrounding words. Offered by vendors including Attensity, Convera, Endeca, Inxight, Recommind and Teragram, entity extraction tools automatically identify elements such as people, places and organizations, and they're often integrated with a search engine. FAST, for example, employs Teragram's categorization technology.

A keyword-based search engine can recognize taxonomies in a limited fashion by correlating keywords in documents to that keyword in your taxonomy. But it won't know, for instance, that a document about Vietnam belongs under the heading "Southeast Asia" because it doesn't understand what those two concepts mean. Concept extraction or concept recognition technologies, offered by vendors including Inxight and Convera, can uncover such relationships. One reason IBM developed its Unstructured Information Management Architecture (UIMA) open-source search framework was to enable search technologies (like keyword and concept search) to work together.

Another quality that can help a search engine make sense of taxonomies is the ability to decipher metatags--for instance, knowing that a tagged number is a product ID number rather than just a random number. Endeca, i411, Mark Logic and Siderean Software are among vendors whose search engines are capable of understanding metadata.

Once you've deployed your taxonomy, it will need training and maintenance to continue to work properly. Terms change and domains grow and morph into new topics. When he was with a previous employer, McCracken of Reed Business Search recalls reviewing a corporate taxonomy and wondering why "Data Processing Machines" was a top-level category and "Computers" a level down; clearly the taxonomy needed updating.

One tool Reed Business uses to keep its taxonomies up to date is a homegrown program called Harvester that regularly crawls targeted Web sites representing specific markets to pick up new terms and metatags that should be included. Factiva's Synaptica and Inxight's Taxonomy Workbench are two tools designed to help maintain taxonomies.

CONSIDER CLUSTERING

For those who can't or don't wish to invest the time and effort necessary to build taxonomies, clustering offers an alternative. For a demonstration of how a clustering engine works, go to www.clusty.com. Type a query in the search field and you'll see how Vivisimo's clustering algorithm groups the results by subject heading (and number of hits within each) along the left side of the screen. If you choose a "tag cloud" presentation, you'll see the topic clusters displayed as a field of words, with the largest, boldest fonts indicating the highest hit results. The categories may not be exactly what you would have expected, but it's a useful filter that helps you quickly find the information you're after.

Endeca's clustering tool groups terms generated by its term-extraction module. Vivisimo's Velocity clustering engine analyzes all search result words and phrases (including strings of up to six words) to come up with themes. "When I first heard about this, I was skeptical because I'm a subject cataloger and I thought it would mean my job," recalls the University of Pittsburg's Silverman. "I tried it out and tried to break it, but couldn't. It really does work because there are certain relationships between words and the words they generally appear with." The University uses the Vivisimo clustering engine on some of its Web sites. The school's Health Sciences Library System also uses it on top of its library catalog so users see manual tagging and automatic clustering results in combination. In a different approach, the American Society of Mechanical Engineering uses Vivisimo's clustering engine to analyze metadata and place associated content into a predefined taxonomy.

STEP UP TO ONTOLOGIES

Ontologies come into play when accuracy is vital. An ontology not only organizes information, it provides precise definitions of terms and logical rules for relationships between terms. It can help you integrate or communicate between two sets of data or disparate taxonomies by a establishing a shared understanding.

"A taxonomy is only concerned with putting terms into buckets," says Bill Andersen, chief scientist at Ontology Works. "An ontology can represent the structure of things categorized in a taxonomy. In fact, most ontologies include a taxonomy as part of their skeleton."

An ontology understands the categories and how they're related to other information. For instance, a geographic ontology would not only recognize Columbia as a city, it would know that it's in Maryland, that it's within the United States and that Maryland adjoins the Chesapeake Bay and is south of New York.

Ontology Works and Teragram both offer ontology software, and it's in growing demand. The federal government uses ontologies in intelligence applications. Pharmaceutical companies use them to manage genomic data and drug design. In these roles ontologies have replaced data models, offering an enhanced, more explicit form of data model.

Compliance is another target application for ontologies. "If you have 100 databases in an enterprise and want to find out if you're in compliance with Sarbanes-Oxley regulations, [you'll have] very high-level questions about how your business is being run," Andersen says. "Those questions [can't be] answered by your databases. Somehow you have to translate the high-level conceptual vocabulary of what Sarbanes-Oxley compliance means into the low-level terms your databases are using."

Similarly in biomedicine, databases often record low-level experimental data, but researchers are trying to find intervention targets for new drugs at a high level. "How do you get from that very high-level question down to the data that can help you answer it?" Andersen asks. "Until a lot of the recent work in ontology, [you had to have] very intelligent humans help that along."

Ontologies are more expensive to build than data models, but Andersen says they offer durability and extensibility that data models can't match. Thus, new product lines or data applications can be easily added without disruption. Future Semantic Web applications will rely on sophisticated ontologies that bring meaning and context to information, so applications can quickly find and use specific data points.

LOOK FOR BREAKTHROUGHS

Vendors are pursuing advances in taxonomy and combinations with search technology. IBM has incorporated its WebFountain taxonomy tool into its OmniFind search software and continues to add partners to UIMA. Oracle is enabling its Secure Enterprise Search to build taxonomies based on those already embedded in applications, making it possible to retrieve application data such as purchase orders and customer records. Most advanced is the ontology work, which we'll see in future discussions of the Semantic Web.

10 REASONS TO USE A TAXONOMY

1. Narrow enterprise search."On the Internet, you have Web pages and links between them, and those links allow you to perceive relationships between pages," notes Yves Schabes, president of natural language search vendor Teragram. "[That's what] made page-rank algorithms famous and Google so successful."

In contrast, there are no links between Word, Excel, PowerPoint and other types of documents within the enterprise, so Web search techniques don't work well. By tagging information according to an enterprise taxonomy--with the aid of extraction and categorization technologies--results can be quickly narrowed down within categories.

2. Improve site navigation. Make sure people coming to your site can actually find the products or information they're looking for. Some search engines provide administrative tools that record when customers have looked for something and not found it. Upon investigation, it often turns out that available products or services simply weren't listed under the right headings.

3. Eliminate redundancy. One international utility company discovered it unknowingly had identical projects under way in the United Kingdom and the United States because the two teams were using different words to describe the efforts. A taxonomy provides companywide terminology, encompassing synonyms and alternative expressions, as well as structure to which information from many sources can be mapped.

4. Maximize the value of intellectual assets. In knowledge-intensive industries, such as publishing, consulting and financial services, intellectual assets gain value the more they're used. A taxonomy organizes and eases discovery of assets, thereby maximizing reuse.

5. Support customer-facing employees. Salespeople can be much more effective if they can quickly find pertinent information before calling on existing or potential customers. And in the call center, time is money, yet customer-service representatives constantly talk to customers who don't know the proper nomenclature for the company's products and services. A taxonomy can help CSRs interpret queries and find requested data.

6. Make corporate resources more accessible. HR, IT and other support areas on corporate intranets are often loaded with terminology only understood within those departments. Taxonomies standardize terminology and can help publishers present information in a logical way. "What's important is that there's an organization scheme you can depend on that most people in the organization carry around in their heads," Gartner analyst Rita Knox says.

7. Ease mergers and acquisition. When two companies merge, it can be hard to meld product lines and cultures; people in different organizations use disparate vocabularies. A unified taxonomy can help provide a common view.

8. Support globalization and localization. Translation and localization efforts are difficult enough. By establishing a global taxonomy, you can lower translation costs, maximize content reuse and avoid inconsistencies in brand building and corporate communications.

9. Streamline business processes. The amount of paperwork involved in drug trials, legal proceedings, legislative or regulatory proceedings and other complex processes can be overwhelming. The hierarchy inherent in taxonomy can at least ease navigation, and help researchers and analysts avoid working at cross purposes.

10. Speed legal discoveries. Lawsuits often lead to discovery requests for all documents related to a specific product or customer within a specified time period. Judges expect swift compliance, yet many companies pay steep fines for failing to comply in a timely fashion. A taxonomy can narrow and speed the search.

Executive Summary

Visionaries say the Semantic Web/Web 3.0 will some day be informed by a massive ontology that will provide a common way for machines to process information on the Web and understand its meaning. In the meantime, ontologies and their simpler, more down-to-earth cousins, taxonomies, are being deployed within companies, in portals and in intranets to make documents and content easier to find and understand.

Reed Business Search is using taxonomies to outperform Google in searching and delivering its content (see "Field Report"). Pharmaceutical companies, manufacturers and others are discovering the difference a well-constructed taxonomy can make in search and knowledge management efforts.

Taxonomies can help you improve site navigation, maximize reuse of content, and ease mergers and acquisitions, but they're not easy to build. You'll have to get departments and far-flung business units to agree on terms and hierarchies. You can save time and effort by starting with prebuilt taxonomies or by using clustering tools. When accuracy is imperative, ontologies define terms and apply rules for relationships between topics. Read on to make the right choices for your enterprise.

Read more about:

20062006

About the Author

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights