Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.
June 22, 2007
14 Min Read
IBM-YaHoo, Microsoft, Oracle, SAP ... why are these big dogs scrapping with specialized vendors like dtSearch, Vivisimo and X1 Technologies for a market Gartner pegged at a measly $370 million in 2006 worldwide revenue?
One word: Mindshare. We've all heard end users ask, "Have you Googled it?" You can't buy that kind of name recognition, and incumbent application vendors want to make darn sure IT groups don't begin to equate enterprise search with those shiny yellow Google appliances.
Moreover, it's clear that profit margins are anorexic mainly because Google and IBM-Yahoo are exerting downward pricing pressure. Not that we're complaining--undercutting competitors is a time-honored, and customer-friendly, tactic. And yet, even at relatively attractive prices, this isn't a fast growing market. Why is that?
After testing search products from dtSearch, Google, IBM, ISYS Search Software, Mondosoft, Thunderstone Software, Vivisimo, X1 in our Green Bay, Wis., Real-World Labs®, we think we know: Current products do not do a good job providing relevant results given large amounts of typical enterprise data. That includes Google's killer PageRank algorithm, which transformed the Web into a truly useful source of information. PageRank simply does not translate well to the enterprise search market. Web search is different from desktop search is different from federated enterprise search, and so far, no one vendor has pulled it all together.
We also found security concerns. An enterprise-search engine goes to work when a user points it at a file share. The software opens every document on the share--even those with sensitive information. Text and metadata from the file are extracted and indexed in a reverse index; most of the products we tested also cache document text or entire documents. We trust you see how that's a problem.
Worth The Price?
Big questions from the CIO: Why do we need this? What do these products really give an enterprise, aside from making it easier for sales folks to pull together proposals? Are they just pricey insurance against inevitable subpoenas? And how does cost stack up against functionality?
In a nutshell, end users are demanding the ability to gather knowledge from all corners of the enterprise, and as more information becomes digitized, being without sophisticated search will hamper productivity.
In addition, while recent changes in the Federal Rules of Civil Procedure didn't in themselves make implementing enterprise search a priority, they certainly jolted many IT groups out of complacency. Companies from Bank of America Securities to Philip Morris have shelled out billions in penalties for failing to provide business records. On a smaller scale, as information stores--databases, CMSs (content-management systems), file and Web servers--overflow with information, employees waste precious time digging for files that a federated-search engine might return in a fraction of a second.
As to the question of how well they work, the eight products we tested have some big customer names, strong staying power ... and not a whole lot of differentiation. You'll find some variation when looking at advanced federated search features that typically involve indexing systems like databases and CMSs, but beware of paying a premium: The biggest bang for your buck is in indexing file and Web servers. The advanced features vendors will try to sell you yield only marginally better results.
How It Works
At the heart of most search systems are three core pieces: A crawling/indexing engine, a query engine and a ranking/relevancy engine. Search vendors have varying names and different ways of segmenting these three pieces, but the paradigm is the same in the end. We did find an exception: X1 does not provide advanced search algorithms, such as fuzzy search or word stemming, or relevancy ranking. It approaches enterprise search with a unique, and more easily secured, desktop client-server architecture.
The crawling/indexing engine is responsible for retrieving documents and data from a source, say a database, file server or CMS, and placing the information into a data structure that can be searched efficiently. In most cases, the data structure is an inverted index. The crawling/indexing engine is also responsible for creating document caches, which are used for creating document "summaries" that are displayed on search-result pages.
The query engine searches for occurrences of keywords in the index and creates a list of documents that contain them. The relevancy/ranking engine is responsible for ordering documents such that, hopefully, those most useful to the user are at the top of the list.
On top of these core pieces vendors add algorithms that help improve search accuracy. An advanced indexing engine, for example, might index document metadata as well as text. A "fuzzy search" capability may find keywords that are misspelled in documents, while an advanced relevancy algorithm would give more weight to documents with more occurrences of key words in a smaller area.
How It's Built
We noted three basic architectures among the tested products. Most popular were Web-based offerings that are accessed solely through a browser. The products from Google, IBM, Mondosoft, Thunderstone and Vivisimo all fall under this category, with the Google and Thunderstone products configured as appliances.
In a desktop-based architecture, indexes are shared through file shares. This architecture is easier to program because developers don't have to write a complicated client-server communication protocol. Search clients access the index directly. Of course, this can create security problems if you don't carefully control what documents are contained in a shared index. Products from dtSearch and ISYS fall into this category.
Finally, X1 submitted a desktop client-server architecture, where the interface for entering queries is separate from the search engine. That means indexes can be secured with more fine-grained control; in fact, the query interface need not even have direct access to the index because it doesn't access the index directly.
Which is best for you? If you plan to use the search product as an interface to your file servers, instead of using a file explorer, choose a product with good desktop integration, as opposed to a Web interface, or one that provides a powerful API for creating custom search clients. If desktop integration isn't important or you'd rather avoid desktop search clients, go with a Web-based offering.
Once you decide on architecture, choosing a search product is easy, right? The product that gives the best search results is the one you want. Going into this review, our premise was simple: ID the killer algorithm that would give one entry an edge in providing relevant results.
Problem is, we didn't find any killer algorithms. We spent hours running the products over various data sets, but didn't even find much variance with advanced algorithms among the products tested. Where we found the most differentiation: how security is handled and product architecture.
Articles Of Federation
Google was clearly the odd duck in our lineup, not because its offering lacks features or competitive pricing--in fact, enterprises can thank Google for the price wars that have made search a relative bargain.
What's different is its algorithm. The beauty of PageRank is that it enlists millions of humans to do what a program does poorly--make subjective decisions. Because Google's PageRank algorithm works by assigning a higher relevance to Web pages that have a higher number of pages linked to them, it deals in objective data. It's extremely easy to write a program that makes objective decisions. Humans, on the other hand, are better at making subjective decisions. So, if 100 people link to page A, which contains content Z, but only five people link to page B, which contains content Z, chances are, page A has more accurate information about content Z and should thus have a higher ranking than page B. Taking the example one step further, page B might have more occurrences of keywords, and keywords might be bold and in headings more often, but page A will still have a higher relevance.
Problem is, the Web and a typical enterprise's cache of data are totally different animals (and Google may be the only vendor in both games right now). Consider the task of indexing the Web and returning relevant results. The Internet consists mostly of structured documents. When parsing an HTML file, it's easy for the parser to determine which words are highlighted--bold, italicized, headers--and to give these terms higher relevance. The parsed text is placed in an inverted index, which is used by a query engine to find matching results.
HTML documents also are structured in the sense that they contain links to other HTML documents. This second form of structure, using link analysis--essentially peer review--is the basis of Google's PageRank algorithm.
Before link analysis became mainstream, strict keyword searches did an OK job of finding relevant documents. But keyword searches are easy to fool by adding hidden text to HTML pages, and simply counting the number of times certain keywords appear in a document does not give a good measure of relevance. Link-analysis algorithms help solve both problems.
Of course, there will be far fewer results returned in the enterprise. In addition, end users often will have a good idea about document names, and your file servers are most likely partitioned by department, further reducing the number of documents users must wade through. The search products we tested have a number of automatic and configurable features that will help improve search-result relevances. All were easy to install and configure, with even the most challenging taking less than an hour.
Don't Look Now
Indexing information across the enterprise can undo all the security controls you've put in place to keep attackers at bay and employees honest, not to mention compliance with regs like SOX and HIPAA.
The first problem has more to do with knowing what information is on your file servers than it does with security. File servers often contain sensitive information, say a document containing passwords or an offer of employment. Indexing a file server will dredge up this quickly.
Obviously, your search system shouldn't return documents that a user wouldn't normally have access to. Better is to not even let users know certain forbidden fruit exists--giving summary info will open a can of worms.
This type of problem would most commonly occur with Web-based products, such as IBM's OmniFind Yahoo Edition, where the client is a Web browser. If users aren't authenticated by the Web server or credentials aren't passed to the Web server in some manner, the search software won't be able to check if the user has rights to selected documents. Then, when the user actually selects a document, a file protocol is used to retrieve it from the server, at which point security will be enforced.
Once the user is authenticated, the search products use three methods for checking privileges: cache security information (ACLs and/or LDAP objects) on the search application and check privileges against the cache, check privileges against an LDAP server or ACL from the originating server, or use the vendor's security API.
All three methods provide the same results, but there's an extra gotcha: Caching security information can boost performance by giving the search solution a fast, local place to verify user privileges, eliminating the need to go to the originating or LDAP server to test credentials for each document. However, cached security information isn't updated in real time. Updates occur only when files are recrawled, creating a lag between when rights are granted and revoked on the originating server and when the cached security info is updated on the search app.
If you need the performance that using cached security information gives but can't budge on the security implications, all is not lost. The products we reviewed, except IBM OmniFind, let IT create multiple indexes and assign rights to those indexes. Then, if the product allows, security checking at the document level can be turned off. Say you use single-sign-on to grant access to a search utility to only those employees who have access to all the indexed content, for instance. Then, the search appliance doesn't have to check user privileges on every document returned in a query. Some of the search products let you grant users and groups access to particular indexes.
Another option is to provide multiple search servers. Using one index to provide search access to all employees isn't good practice--one index means one super-user account that can access all indexed content.
Bottom line, ensure your search mechanism is in lockstep with the security of the originating systems. Never forget that the search product stores content, maybe even a copy of the entire document, on a server separate from the originating server--meaning it is no longer governed by the rules that govern the original content.
Another area of consideration is privacy, concerning two areas where content is stored: e-mail and users' desktops. As with all indexed content, ensure that proper authentication and privilege checking are in place when providing search capabilities to an e-mail database.
Users' desktops may also contain sensitive information, but a bigger issue is that to index a local desktop and let people other than the local user search the index, you'd have to create shares on every desktop--a potential security nightmare if your LAN gets hit with a virus that spreads through file shares. Desktops are better served with local indexing and search programs like Google Desktop, OmniFind Yahoo! Edition or X1 Enterprise Client.
In our reviews, we discuss how each product handles security, including whether it supports caching ACLs.
SEARCH BY THE NUMBERS
40: Percentage of unit licenses for enterprise search sold that Google will provide by 4Q 2007. Source: Gartner
Less than 5 percent: Global 2000 companies that will have selected Google as their primary information access software vendor by 4Q07 Source: Gartner
10: Percentage of unit licenses for enterprise search sold that Microsoft will provide by 4Q08 Source: Gartner
$30,000: Cost for Google Search Appliance capable of searching as many as 500,000 documents. Source: Google
$57,670: Estimated price for Microsoft SharePoint Server 2007 for Enterprise Search Source: Microsoft
$0: Cost for the OmniFind Yahoo! Edition to index as many as 500,000 documents (download at omnifind. ibm. yahoo.com). Support can be purchased for $1,999 per year. Source: IBM
CACHE AND CARRY
To decide whether you need enterprise search now or can wait for offerings to mature, you need an idea of how much time your employees spend searching for content, the location of the content that employees are seeking and what the information is used for.
For salespeople who must pull results from e-mail, a file server and a Web server to build a proposal, federated search can bring a lot of value. For remote employees who keep a lot of data on their local drives, a search app that integrates tightly with the desktop, like X1's Enterprise client, is ideal. These have a multilevel architecture--desktop agents plus server-based indexing. The client can index the local computer and communicate with a "cluster" to search server file shares.
If you plan to turn on caching, you'll have to determine if you want just text cached, with no images, or the entire document. Obviously, the latter will greatly increase the amount of space needed. On the flip side, if full-document caching is enabled, users will still be able to query and view documents when the originating source is down, provided security information is also cached, or the infrastructure is such that the search engine doesn't have to verify rights against the originating source.
Another caveat of caching depends on how the software indexes a document. Some vendors don't index common words, which helps reduce the size of the index. The downside is, a separate document cache must be created, by caching the whole document or just the document text. The cache is needed to generate page summaries. Other vendors index every word at the cost of a larger index and the benefit of not needing to create a separate document cache. Because every word is contained in the index, summaries can be generated from the index, but this method of generating summaries can be slower than generating summaries from a document cache.
Having an API available to plug into the search engine also may be of benefit. All the products we tested provide APIs to modify the behavior of some aspect of the product, like indexing and querying the index, and depending on the functionality provided by the API, developers could hook the search product into an ERP or SFA app. Thunderstone provides an XSL interface, for example, while dtSearch offers an API available in C/C++, COM, .Net and Java.
Ben Dupont is a systems engineer for WPS Resources in Green Bay, Wis. He specializes in software development. Write to him at [email protected].
You May Also Like
University of Minnesota Uses Entuity to Strategically Manage and Upgrade Complex Network Environment
NIST Cybersecurity Framework 2.0: Changes, impacts, and opportunities for your InfoSec program
*Why DDI? Why it is Important to Integrate DNS, DHCP, and IP Address Management in Your Network
KVM SwitchÂ High Performance Applications with Dominion KX III
2022 Retrospective: The Emergence of the Next Generation of Wi-Fi