The unnamed technology--Xerox refers to it as a categorizing tool--is available now and can be licensed by companies that want to incorporate it into existing document systems, as well as by third-party software vendors in the document-management, customer-relationship management, and information-retrieval markets, Xerox said.
The tool, said Eric Gaussier, a researcher at the Grenoble facility, uses a hierarchical model able to understand the dependency between multiple categories, unlike "flat" search-and-retrieval tools that treat each category separately. Biochemistry and biophysics, for example, are closely related--and are treated as such by Xerox's solution--while flat retrieval systems would consider them separate and thus not cross-link documents in each.
The result of this approach, Gaussier said, is faster, better searches, and a virtual hands-off approach to digesting and disseminating digital documents throughout an organization.
In the pilot program that Xerox ran with the Swiss Institute of Bioinformatics, an academic nonprofit foundation, "their traditional search engines for medical articles often presented the most pertinent documents at the end of the list," said Gaussier. "Using our software, they were much more successful at finding what they were looking for, and typically had to browse less than half of the list to find the information."
Xerox's new software, written in Java and suitable for deploying on Unix, Linux, and Windows, is the result of four years of steady work in linguistic modeling, semantics, and machine learning, said Gaussier.
It can be used out of the box by adding it to existing document-management applications created by a company, he added. In that approach, "with a set of categories already established, the software take documents already categorized and using our models, 'learns' how to automatically classify new documents"
In a fresh environment not already equipped with a document management and routing solution, Xerox's tool walks users through the process of creating categories, then classifies documents as part of one or more of those categories.
In either case, the technology is bright enough to learn new categories on its own as it comes across additional documents. "After a while, if the system doesn't cover all the new topics that have emerged, it will tell you where it's not up to date," and dynamically suggest new categories, Gaussier said.
The software can handle documents written in up to 20 different languages, it also serves as an automatic router, shunting categorized documents to the right person--via E-mail attachments, for instance--based on a pre-set user profile that administrators establish. "This can be used, for example, to route incoming mail to the person responsible for a given topic and eliminate mail in your inbox you aren't interested in," Gaussier said. "Imagine clients' complaints going directly to the person responsible for handling them and your E-mail in-box containing only what you're interested in."