The Word on Text Mining

Text analytics provide concept discovery, automated classification, and innovative displays for volumes of unstructured documents.

Many companies and government agencies already use text mining, albeit for very specialized applications. Because competitiveness and security concerns will only grow in the coming years and text mining extends well-understood search and data mining concepts, the scope and pervasiveness of text-mining applications are bound to grow rapidly.

Techniques and Vendors

Text mining is a two-stage process of categorization and classification. First, you figure out how to describe documents and their contents including the concepts they contain, and then you bin documents into the descriptive categories and map inter-document relationships according to the newly detected concepts. This approach is similar to segmentation and classification through data mining; I see data mining's clusters as analogous to text-mining-generated concepts. Once you have classified according to categories, you can do something akin to OLAP-style slice-and-dice analysis of multidimensional data sets in order to tease interesting details — anomalous or exceptional information — out of the larger document sets.

Barak Pridor, president of text-mining vendor ClearForest describes text-mining steps as "semantic, statistical, and structural analysis that classifies documents and discovers buried persistent entities, event, facts, and relationships" in a process he calls "intelligent hybrid tagging." Pridor distinguishes document-level tags (descriptive elements like subject and author) from "inner document tags" that work with families of entity types (that is, with conceptual groupings).

Text-mining offerings are by no means uniform. For example, many implementations such as those from Autonomy derive or import taxonomies (hierarchical knowledge representations that include concept definitions) for use in classifying and relating documents. Autonomy's director of technology strategy, Ron Kolb, claims, "Autonomy is unique in being mathematically based, using pattern matching and statistical analysis across multiple languages and multiple platforms." Autonomy uses Bayesian statistics, which assess relevance based on prior probabilities, and Claude Shannon's information theory to facilitate extracting concepts from document sets. The result is to contain the effect of the vagaries of human languages.

Not everyone agrees that a statistically focused approach to categorization is best. Claude Vogel, CTO of Convera, told me, "You cannot build high-level taxonomies and ontologies that way. You can't escape the manual librarian-style work." (Roughly put, an ontology provides meaning for a knowledge domain, while a taxonomy organizes that knowledge.) That doesn't mean that you need an army of taxonomy builders to work with Convera's RetrievalWare because, as with Autonomy's products, you can import XML-expressed taxonomies. Convera also shares with Autonomy the distinction of searching media such as audio, images, and video in addition to text.

Autonomy has focused on its mining engine, offering options such as weighting, supporting a large number of languages, and providing interfaces that integrate its products with BI, CRM, ERP, and other enterprise applications. Inxight Software, by contrast, is a notable vendor that, like ClearForest, has devoted significant resources to developing front ends. Inxight's Star Tree, for example, lets you explore network maps via hyperbolic visualization where segment details are enlarged or collapsed as you move the focus from one map node to another. Inxight, like Autonomy, provides back-end categorization and taxonomy management software to other companies including ClearForest and SAS.

SAS dominates the high-end data analysis market. Its Text Miner incorporates Inxight technology for linguistic analysis and concept extraction but gives the results a statistical spin that can be matched by few other vendors. According to product manager Manya Mayes, Text Miner and the Enterprise Miner data-mining tools are fully integrated, where textual-analysis results become available as structured data for application of a full range of traditional analytic approaches.


Although still in its infancy, text mining promises rapid advances in the scope of applications and in the effectiveness, comprehensiveness, interoperability, and usability of software implementations. The field won't be mature until commercial tools offer closed-loop analytics, that is, actionable results rather than just visualizations, analytics that are well integrated with data mining, and statistical analysis systems that use all an organization's information assets. Although techniques seem fairly well established, maturity will also bring standardized interfaces and input and output formats, extension to a spectrum of rich media in addition to plain text, the scalability to world-size applications, and predictive capabilities. The implementations available show that researchers and vendors are on the right track.

Seth Grimes [[email protected]] heads Alta Plana Corp., a Washington, D.C.-based consultancy specializing in analytic computing systems and demographic, economic, and marketing statistics.