Search Engine Finds Needles In Haystacks

Organize results from multiple sources to provide a single list of indexed results for any search without using taxonomies.

Sean Doherty, Contributor

May 2, 2005

4 Min Read

Search engines are supposed to speed data retrieval, but the reality is that results of intranet, extranet and Internet searches commonly span multiple pages that devour your time and blur your vision. We need a search tool that makes sense of disparate resources and returns relevant results for quick review. By clustering search results on the fly, Vivisimo Velocity 4.2 gives you a magnet to pull the needle from the haystack without a costly taxonomy.

Velocity is a search engine with clustering and metasearch capabilities. It supports multiple platforms and uses XML configuration files. Its query-meta CGI script uses a C library that has been ported to Windows, Linux, FreeBSD, Sun Solaris and other Unix programs, which means clustering and metasearch components run on Linux, Solaris and Windows, but only Linux can run the search engine at this time.

I downloaded and installed Velocity on a dual-processing PIII server with 1,024 MB of RAM in our Syracuse University Real-World Labs®. The server ran a copy of Red Hat Linux 7.3 (Valhalla) and Apache Web server 1.3.23. This setup easily met Velocity's minimum requirements of a PIII server with 256 MB of RAM.

A Linux Story

I installed a shell script on the server from the command line without problems. Velocity's Web administration can be summed up in three Cs: clean, compact and configurable. Velocity defines global options that control clustering, metasearch and general setup for search sources and collections, as well as global variables used for XSL transformations. The default settings were sufficient to get me started, but if you plan to use Vivisimo's API integration you may have to modify the "query-meta" project.

Vivisimo lets you choose external search sources for your network to use, including general search engines such as GigaBlast and Google and specific Web sites like www.nwc.com. In accordance with your configuration, Velocity parses the resulting XML feed or HTML output with XSL to provide clustered search results.

I tested this by creating a source of GigaBlast. Although the process was intimidating, I got started without diving too deep into the documentation. From the admin page, I entered a source URL ( http://www.gigablast.com/search?raw=8), a get method to obtain input, and several parameters to identify the query string and set the number of results per page. The template included advanced options, declarations, testing and XML. An advanced section in the source configuration delved into matching the logical operators like and to plus (+), not to minus (-), phrase to quotes (" ") and so on.

Next, I made a search collection of all the content on the Network Computing Web site by adding a seed URL (http://www.nwc.com) and restricting the page output to the nwc domain. On a business day, Velocity's crawler snatched more than 23,000 URLs and indexed them in approximately 547 MB of disk space.

Velocity includes a staging area to eliminate downtime while sites are crawled and indexed. It maintains both a staging version and a live version of a collection. When a collection is being updated on staging, the live version stays up and available. When the staging version completes its crawl and index, it replaces the live version.

Good

• Federated search across multiple resources• Categorizes search results without taxonomy• Highly configurable using XML

Bad

• Requires XML and XSL knowledge• Search engine limited to Linux• No wizards for configuration

Vivisimo Velocity 4.2 Enterprise Search Platform, starts at $10,000. Vivisimo, (412) 422-2499. www.vivisimo.com

Clustered Results

To configure parsing and the results display, I went to the Parser tab and selected the XSL Parser for a direct conversion of XML input to XSL output. I also specified data element locations in XML input as XPATH expressions to output results in the form of URLs, titles, summaries or snippets, and total number of results or response hits. For more advanced topics, such as adding metadata to documents or combining content to create logical documents, you must create the style sheets.

After I added the Network Computing search collection to my other search sources, I set up a search for "sewing needles" using all my sources. The results were displayed on a page that showed clusters on the left organized by topic as set by my configs, and a general laundry list on the right organized by relevance rankings and date. When I clicked on a cluster, I was given results for that topic, organized by concept. Then I set up a search of "How we tested" and selected only nwc.com as a source. The results were clustered or grouped into discrete topics of How We Tested accelerators, infrastructure devices and so on.

Getting Velocity up and running to give users metasearch capabilities with clustered results was easy, and the results are powerful. This time-saver will show immediate ROI, but get familiar with XML/XSL technology to get the most of it.

Sean Doherty is a senior technology editor and lawyer based at our Syracuse University Real-World Labs®. Write to him at [email protected].

Read more about:

20052005

About the Author(s)

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights