Data integration will be a top story in information technology in 2011.
Whether your interests are in business intelligence, information access, or operations, there are clear and compelling benefits in linking enterprise data -- customer profiles and transactions, product and competitive information, weblogs, -- to business-relevant content drawn from the ever-growing social/online information flood.
ETL (extraction, transform, load) to data stores, together with the younger, load-first variant ELT, will remain the leading integration approaches. But they'll be complemented by new, dynamic capabilities provided by mash-ups and by semantic integration, driven by data profiles (type, distribution, and attributes of values) rather than by rigid, application-specific data definitions.
These newer, beyond-ETL approaches constitute a New Data Integration. The approaches were developed to provide easy-to-use, application-embedded, end-user-focused integration capabilities.
The New Data Integration responds to the volume and diversity of data sources and needs and to growing demand for do-it-yourself data analysis. I explored these ideas last year in an article on 'NoETL'. In this follow-up I consider five examples, with capsule reviews of same-but-different approaches at Tableau, Attivio, FirstRain, Google, and Extractiv. Each example illustrates paths to the new data integration.
Tableau: Easy Exploration
No BI vendor better embodies the DIY spirit than Tableau Software. The company's visual, exploratory data analysis software lets end users delve into structured data sources and share and publish analyses. By "structured data sources," I mean anything ranging from Excel spreadsheets to very large databases managed with high-end data-warehousing systems. Tableau's power and ease of use has won the company an enthusiastic following.
Tableau's Data Blending capability, new in November's Tableau 6.0 release, caught my attention. The software will not only suggest joins for data fields across sources, by name and characteristics; according to Dan Jewett, Tableau VP of Product Management, it will also aggregate values, for instance rolling up months to quarters, to facilitate fusing like data stored at different aggregation levels.
The software also supports "alias values" for use in blending relationships. For instance, it can match state names to abbreviations, part numbers to part names, and coded values such as 0 and 1 for "male" and "female."
Usage scenarios include comparing budget and sales projections to actuals, where users may compare spreadsheet-held values to corporate records. The software also supports blending of external-source information into corporate data.
"Marketing organizations often get data feeds from suppliers and partners they want to join in with the in-house CRM system data," Jewett explains. "These are often ad-hoc feeds, so structured processes that IT likes to build don't support this case."
Tableau can pull data from Web sources via application programming interfaces (APIs) adhering to the Open Data Protocol (OData) standard. This capability will help users keep up with the growing volume of online data.
Tableau, like the vast majority of BI applications, does work exclusively with "structured" data. That focus must and will change as users confront an imperative to tap online and social sources, via search- and text-analytics enhanced BI.
Attivio: Universal and Unified
Enterprise search and BI have each been around for decades, largely operating in information silos, one restricted to documents and the other to data collected from operational and transactional systems. Attivio's aim, dating to the company's 2007 founding by refugees of FAST (a Microsoft subsidiary since 2008), has been to break down the database-document barrier by providing a search interface that relies on a single, unified index. Attivio delivers results in familiar BI dashboards and analysis widgets.
Attivio pulls data from a very wide variety of disparate sources, from files and databases and also e-mail, content-management, and enterprise-application systems via APIs and connectors (supplied by the company and partners).
The Attivio Active Intelligence Engine (AIE) will extract content (text, metadata, structure information), manipulate it, enrich it, and link or join it. "Enrichment components such as sentiment and entity extraction and classification can be used to add intelligence to the integration process," says company co-founder and CTO Sid Probstein. "They require some setup work, mostly training on the customer's data."
Attivio performs "dynamic schema creation" based on discovered data values and types, and "we have a number of components that identify and report on integration opportunities after a small data set is processed," Probstein says.
Attivio AIE's dynamic schemas support ad-hoc integration of diverse data, but it is by no means the only credible search-BI technology on the market. Endeca's Information Access Platform (IAP) uses similar techniques to provide similar capabilities, targeting online and mobile commerce and publishing in addition to search-BI. Other, specialized platforms adapt these integration techniques to focused business problems and information domains.
FirstRain Senses Time
FirstRain is a business-information search and monitoring tool that mines and integrates information from the open Web -- news, blogs, and industry, government, scientific, and academic sources -- in addition to a set of key corporate-information databases. The aim, per the company's Web site, is to "derive relationships, spot changes in management or business structure, and track trends across industries."
"The application of semantic analysis that is 'business structure aware' is crucial to be able to identify and deliver relevant business information that is scattered throughout [disparate] sources," says the company's technology vice president, Marty Betz. Also crucial is the ability to synthesize time sequence from pages found on the open Web.
(Time sequence is important! Indeed, the number-three result returned by a Google search for "us senator pennsylvania" was now-former Senator Arlen Specter's now-disappeared Senate Web page.)
"By analyzing the flow of content through our pipeline, the system can dynamically model and adjust its understanding of the market ecosystems around companies and industries," Betz says.
Betz describes the use of trending and anomaly detection, applied to unstructured narrative content from a variety of sources, to enable a different class of questions to be systematically asked, analyzed and answered via answers that require "connecting the dots."
So in FirstRain we have broad-but-selective content acquisition and integration, with the application of goal-relevant organizing principles, to respond to a high-value business need: timely access to corporate developments.
Google Sees Similarities
Google is, of course, king of open-Web mining, aiming to index the Internet-accessible universe. By adopting search facets like Endeca's, the ability to benefit from metadata, structure, and context in crawled content (for instance, page/document type, publication date, and localization), and content analytics techniques that include sentiment analysis, Google is transforming itself from a search engine into an information-access provider.
Rather than subject readers to a disquisition on Google, I'll point you to a 2009 paper that explains technical ins and outs of Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics. Quoting the introduction:
"An important step in integrating heterogeneous datasets is determining a mapping between objects from one source and objects from another source, a step variously known as record linkage, matching, and deduping (among other terms) in the literature. One useful matching strategy is to use an appropriately thresholded similarity function, i.e., to consider objects as identical if they are 'similar enough'."
The Google authors talk of applications such as "merging the catalogs of many merchants" -- clearly their interest is in improving Google as an online comparison-shopping tool. Like FirstRain, they make application-appropriate assumptions, in this paper, that a catalog from a given merchant won't list a given product more than once.
The authors discuss "soft joins" based on statistical similarity measures, and they also see advantages if you can use "hard identifiers -- strings that can act as clear and unambiguous identifiers." They offer ISBN (publication number), UPC (product code), and Web URL examples.
A Uniform Resource Locator (URL) is a Uniform Resource Identifier (URI) that specifies an access mechanism such as HTTP and FTP. And URIs are the key to Linked Data integration across the the nascent Semantic Web.
Extractiv Spots Entities
Extractiv is a new company and service that joins Web-crawling technology from 80legs with semantic annotation and analysis software from Language Computer Corporation. Extractiv delivers software-as-a-service (SaaS) text analytics that identify "entities," such as names of individuals, companies, and places, as well as sentiment and relationships among entities found in source text.
A number of services, like Extractiv, support over-the-Web text (and sentiment) analysis; examples include Orchestr8's AlchemyAPI, Clarabridge, Evri, Lexalytics, OpenAmplify, Saplo, and Thomson Reuters' OpenCalais, and Zemanta. Some offer access to the Linked Data Web. You can see this capability via Extractiv. Here, I am using a Web API to annotate a White House blog page on the recent Tucson, Arizona shootings. Click on Gabrielle Giffords in the center pane and scroll until you see the Details area on the right. That area links, via a URI for Rep. Giffords, to a DBpedia page on the congresswoman, whom I chose for this example in order to honor her. This is an example of content enrichment made possible by the New Data Integration.
The DBpedia database contains structured information (e.g., data tables) extracted from Wikipedia. Check out the Gabrielle Giffords page directly and you'll see strengths that derive from the use of integration-friendly Semantic Web formats and weaknesses that include the incompleteness and the sloppiness, let's call it, of the presented information.
Click around the Extractiv-annotated page a bit more and you'll see other strengths and weaknesses: The service's ability to resolve expressions (that "almost 40 years" is a time duration), coreferences (such as the use of "she" and "her" to refer to the representative), and relationships (correctly parsing "Her husband, Mark Kelly"). Yet Extractiv did not identify "Gabby Giffords" and "Gabrielle Giffords" as a single person.
The Integration Road Ahead
There's work to be done to improve every system that deals with diverse, complex data. As my New Data Integration examples show, however, a variety of companies have made very significant progress meeting technical and business challenges. Expect more of the same -- advances toward easy-to-use, application-embedded, end-user-focused integration capabilities -- in the year to come.