Software // Information Management
Commentary
1/11/2011
06:32 PM
Seth Grimes
Seth Grimes
Commentary
Connect Directly
Twitter
RSS
E-Mail
50%
50%

5 Paths To The New Data Integration

Embedded, automatic and easy new approaches meet growing demands for do-it-yourself data analysis.

Google Sees Similarities

Google is, of course, king of open-Web mining, aiming to index the Internet-accessible universe. By adopting search facets like Endeca's, the ability to benefit from metadata, structure, and context in crawled content (for instance, page/document type, publication date, and localization), and content analytics techniques that include sentiment analysis, Google is transforming itself from a search engine into an information-access provider.

Rather than subject readers to a disquisition on Google, I'll point you to a 2009 paper that explains technical ins and outs of Data Integration for Many Data Sources using Context-Sensitive Similarity Metrics. Quoting the introduction:

"An important step in integrating heterogeneous datasets is determining a mapping between objects from one source and objects from another source, a step variously known as record linkage, matching, and deduping (among other terms) in the literature. One useful matching strategy is to use an appropriately thresholded similarity function, i.e., to consider objects as identical if they are 'similar enough'."

The Google authors talk of applications such as "merging the catalogs of many merchants" -- clearly their interest is in improving Google as an online comparison-shopping tool. Like FirstRain, they make application-appropriate assumptions, in this paper, that a catalog from a given merchant won't list a given product more than once.

The authors discuss "soft joins" based on statistical similarity measures, and they also see advantages if you can use "hard identifiers -- strings that can act as clear and unambiguous identifiers." They offer ISBN (publication number), UPC (product code), and Web URL examples.

A Uniform Resource Locator (URL) is a Uniform Resource Identifier (URI) that specifies an access mechanism such as HTTP and FTP. And URIs are the key to Linked Data integration across the the nascent Semantic Web.

Extractiv Spots Entities

Extractiv is a new company and service that joins Web-crawling technology from 80legs with semantic annotation and analysis software from Language Computer Corporation. Extractiv delivers software-as-a-service (SaaS) text analytics that identify "entities," such as names of individuals, companies, and places, as well as sentiment and relationships among entities found in source text.

A number of services, like Extractiv, support over-the-Web text (and sentiment) analysis; examples include Orchestr8's AlchemyAPI, Clarabridge, Evri, Lexalytics, OpenAmplify, Saplo, and Thomson Reuters' OpenCalais, and Zemanta. Some offer access to the Linked Data Web. You can see this capability via Extractiv. Here, I am using a Web API to annotate a White House blog page on the recent Tucson, Arizona shootings. Click on Gabrielle Giffords in the center pane and scroll until you see the Details area on the right. That area links, via a URI for Rep. Giffords, to a DBpedia page on the congresswoman, whom I chose for this example in order to honor her. This is an example of content enrichment made possible by the New Data Integration.

The DBpedia database contains structured information (e.g., data tables) extracted from Wikipedia. Check out the Gabrielle Giffords page directly and you'll see strengths that derive from the use of integration-friendly Semantic Web formats and weaknesses that include the incompleteness and the sloppiness, let's call it, of the presented information.

Click around the Extractiv-annotated page a bit more and you'll see other strengths and weaknesses: The service's ability to resolve expressions (that "almost 40 years" is a time duration), coreferences (such as the use of "she" and "her" to refer to the representative), and relationships (correctly parsing "Her husband, Mark Kelly"). Yet Extractiv did not identify "Gabby Giffords" and "Gabrielle Giffords" as a single person.

The Integration Road Ahead

There's work to be done to improve every system that deals with diverse, complex data. As my New Data Integration examples show, however, a variety of companies have made very significant progress meeting technical and business challenges. Expect more of the same -- advances toward easy-to-use, application-embedded, end-user-focused integration capabilities -- in the year to come.

Seth Grimes is an analytics strategist with Washington DC based Alta Plana Corporation and chair of the Sentiment Analysis Symposium.

Previous
3 of 3
Next
Comment  | 
Print  | 
More Insights
The Agile Archive
The Agile Archive
When it comes to managing data, don’t look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - July10, 2014
When selecting servers to support analytics, consider data center capacity, storage, and computational intensity.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join InformationWeek’s Lorna Garey and Mike Healey, president of Yeoman Technology Group, an engineering and research firm focused on maximizing technology investments, to discuss the right way to go digital.
Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.