Taking advantage of big data is often defined as the ability to wring business value from the huge volume, variety, and velocity of information becoming available. Yet a recent Gartner report on big data adoption shows that many organizations find the “variety” dimension of big data a much bigger challenge than volume or velocity. When asked about the dimensions of data organizations struggle with most, 49% answered variety, while 35% said volume, and 16% replied velocity.
It’s worth coming to grips with variety. The ability to combine multiple data sources into a whole that is greater than the sum of its parts can let the business glean new insights. So how can IT combine multiple data sources with different structures while retaining the flexibility to add new ones into the mix as you go along? How do you query the combined data? How do you look for patterns across subsets of data that came from different sources?
With a single, controlled dataset, a central schema can guide users and applications in how to find and use data. With an evolving mix of ad hoc data sources, constant updating of a central schema capable of tying them all together is impractical. While IT people in the NoSQL camp of the big data world might argue about schema-based processing vs. schema-less processing, RDF standards (sometimes referred to as "semantic web technologies") offer the perfect compromise: optional schemas.
When data is accessible using the simple RDF triples model, you can mix data from different sources and use the SPARQL query language to find connections and patterns with no need to predefine schema. Leveraging RDF doesn't require data migration, but can take advantage of middleware tools that dynamically make relational databases, spreadsheets, and other data sources available as triples. Schema metadata can be dynamically pulled from data. It is stored and queried the same way as data.
Metadata can be as simple or as rich as you need (including the use of additional descriptive powers of OWL, the Web Ontology Language). It doesn't have to be an overarching specification that covers every structure that may come up; it can describe only the bits that interest you, and it can evolve. Some of these bits can describe relationships among other bits, so integration of a fifth data source into a set of four, for example, may require no more than describing how a few properties within the fifth one relate to some of the properties in the first four.
W3C, the standards body behind RDF, highlights these important qualities by saying that “RDF has features that facilitate data merging even if the underlying schemas differ, and it specifically supports the evolution of schemas over time without requiring all the data consumers to be changed.”
Having worked with RDF since its original standardization in 2004, I have noticed how increasing challenges of big data variety are now speeding adoption of RDF-based approaches across several industries. For pharmaceutical companies, data variety leads to greater time and cost of developing drugs. Data sets from different clinical trials may not have been designed to be used together, but when they cover related subjects there are good reasons to aggregate them and look for connections. Adding in selections from the wide range of standardized taxonomies and ontologies in this field can bring more benefits, but also more challenges because of the different data models.
The RDF approach to selectively describing and mapping only the necessary parts of these data models gets the job done quicker and in a way that is reusable. This means that life sciences researchers can eventually query across data sets much more quickly.
Like the life sciences, energy industries often seek insights from combinations of independently designed datasets. Oil and gas companies in particular have been using RDF-based technology to combine sets of production, exploration, and environmental data from different parties to create reports with flexibility not possible before.
Taking full advantage of semantic web technology still has a few challenges, however.
Because the SPARQL query language and the specialized databases known as triplestores are still relatively young, algorithms and other techniques to optimize them for speed have more progress yet to make. In the early days of relational databases, die-hard hierarchical and network database administrators insisted that although the extra levels of indirection in relational databases may have offered some nice flexibility, they cost too much in processing cycles; since then, commercial and academic work to optimize relational databases has come far, and similar work is already under way at several universities and major software companies for SPARQL and triplestores.
Another challenge is the educational effort necessary for people to learn this new style of modeling. On one hand, it doesn't always follow a straightforward deterministic procedure, like getting relational tables into third normal form. But on the other, unlike, for example, with object-oriented modeling, you don't have to determine all of your structures and relationships before you can start building working applications. You can start small and work up from there.
In fact, you can start as small as you want because this technology scales down as well as up. Getting Hadoop running on even a single node requires a lot of setup and configuration, but commercial and open source tools are available that let you work with RDF schemas and ontologies using just a few text files on your hard disk. Being standards-based, these tools are interoperable, and they can let you get your feet wet as you explore this promising new approach to combining a variety of data sources in ways that let you get more value out of that data.
Have you taken a shine to SPARQL? Tell me about it in the comments.