Big Data // Big Data Analytics
Commentary
12/9/2013
12:00 AM
Irene Polikoff
Irene Polikoff
Commentary
50%
50%

Variety's The Spice Of Life – And Bane Of Big Data

Advances in hardware, especially memory technology, make volume and velocity easier to deal with all the time. But that doesn't help with variety. For that, you need RDF and optional schemas.

Taking advantage of big data is often defined as the ability to wring business value from the huge volume, variety, and velocity of information becoming available. Yet a recent Gartner report on big data adoption shows that many organizations find the “variety” dimension of big data a much bigger challenge than volume or velocity. When asked about the dimensions of data organizations struggle with most, 49% answered variety, while 35% said volume, and 16% replied velocity.

It’s worth coming to grips with variety. The ability to combine multiple data sources into a whole that is greater than the sum of its parts can let the business glean new insights. So how can IT combine multiple data sources with different structures while retaining the flexibility to add new ones into the mix as you go along? How do you query the combined data? How do you look for patterns across subsets of data that came from different sources?

With a single, controlled dataset, a central schema can guide users and applications in how to find and use data. With an evolving mix of ad hoc data sources, constant updating of a central schema capable of tying them all together is impractical. While IT people in the NoSQL camp of the big data world might argue about schema-based processing vs. schema-less processing, RDF standards (sometimes referred to as "semantic web technologies") offer the perfect compromise: optional schemas.

When data is accessible using the simple RDF triples model, you can mix data from different sources and use the SPARQL query language to find connections and patterns with no need to predefine schema. Leveraging RDF doesn't require data migration, but can take advantage of middleware tools that dynamically make relational databases, spreadsheets, and other data sources available as triples. Schema metadata can be dynamically pulled from data. It is stored and queried the same way as data.

Metadata can be as simple or as rich as you need (including the use of additional descriptive powers of OWL, the Web Ontology Language). It doesn't have to be an overarching specification that covers every structure that may come up; it can describe only the bits that interest you, and it can evolve. Some of these bits can describe relationships among other bits, so integration of a fifth data source into a set of four, for example, may require no more than describing how a few properties within the fifth one relate to some of the properties in the first four.

W3C, the standards body behind RDF, highlights these important qualities by saying that “RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.”

Having worked with RDF since its original standardization in 2004, I have noticed how increasing challenges of big data variety are now speeding adoption of RDF-based approaches across several industries. For pharmaceutical companies, data variety leads to greater time and cost of developing drugs. Data sets from different clinical trials may not have been designed to be used together, but when they cover related subjects there are good reasons to aggregate them and look for connections. Adding in selections from the wide range of standardized taxonomies and ontologies in this field can bring more benefits, but also more challenges because of the different data models.

The RDF approach to selectively describing and mapping only the necessary parts of these data models gets the job done quicker and in a way that is reusable. This means that life sciences researchers can eventually query across data sets much more quickly.

Like the life sciences, energy industries often seek insights from combinations of independently designed datasets. Oil and gas companies in particular have been using RDF-based technology to combine sets of production, exploration, and environmental data from different parties to create reports with flexibility not possible before.

Taking full advantage of semantic web technology still has a few challenges, however.

Because the SPARQL query language and the specialized databases known as triplestores are still relatively young, algorithms and other techniques to optimize them for speed have more progress yet to make. In the early days of relational databases, die-hard hierarchical and network database administrators insisted that although the extra levels of indirection in relational databases may have offered some nice flexibility, they cost too much in processing cycles; since then, commercial and academic work to optimize relational databases has come far, and similar work is already under way at several universities and major software companies for SPARQL and triplestores.

Another challenge is the educational effort necessary for people to learn this new style of modeling. On one hand, it doesn't always follow a straightforward deterministic procedure, like getting relational tables into third normal form. But on the other, unlike, for example, with object-oriented modeling, you don't have to determine all of your structures and relationships before you can start building working applications. You can start small and work up from there.

In fact, you can start as small as you want because this technology scales down as well as up. Getting Hadoop running on even a single node requires a lot of setup and configuration, but commercial and open source tools are available that let you work with RDF schemas and ontologies using just a few text files on your hard disk. Being standards-based, these tools are interoperable, and they can let you get your feet wet as you explore this promising new approach to combining a variety of data sources in ways that let you get more value out of that data.

Have you taken a shine to SPARQL? Tell me about it in the comments.

Comment  | 
Print  | 
More Insights
Comments
Threaded  |  Newest First  |  Oldest First
SteveS489
100%
0%
SteveS489,
User Rank: Apprentice
12/21/2013 | 12:16:52 PM
OSLC takes a shining to SPARQL
Like your Oil & Gas example, as well as the benefits you quote from W3C, we have been expoiting this as one of the primar use caes when integrating various system and software development tools together, most notabiliy of the work I've done with IBM and Open Services for Lifecycle Collaboration (OSLC).  By embracing and leveraging the RDF data model within various tools, we are able to pull data from tools from various disciplines (requirements, architecture, modeling, resource planning, project management, incident management, ..) and provide comprehensive insight into the end-to-end process of system and software delivery.  System and sofware development and delivery is just yet another use cae of application integration leveraging Linked Data and RDF.

To reinforce your points, having data is very important.  Though having data with some structure helps to further understand it.  Data then with some schema and semantics just furthers the intelligence, and even efficiencies, of your queries or reports.  Additionally, getting the data to also leverage standards based schemas/vocabularies raises the bar even further to ease of working with the data with well known meaning and semantics across data sources.  Even if various data sources don't initially support the full depth of standards based vocabularies, they can gradually supprt them or have another application augment it by supplying the additional mappings.

- Steve Speicher
6 Tools to Protect Big Data
6 Tools to Protect Big Data
Most IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest, Nov. 10, 2014
Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on InformationWeek.com for the week of November 9, 2014.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.