Big Data: How To Pick Your Platform - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Software Platforms
03:42 PM
Connect Directly

Big Data: How To Pick Your Platform

Hadoop? A high-scale relational database? NoSQL? Event-processing technology? One size doesn't fit all. Here's how to decide.

Download this entire InformationWeek Tech Digest issue, distributed in an all-digital format (registration required).

Feel stuck in neutral? Don't worry. Big data success stories tend to start slowly, for two reasons.

First, there's the drag exerted by relational database administrators who badly want to stick to what they know. Second, big data problems have just as much to do with changing how you do data querying and processing as they do with handling the oft-cited "three V's" -- the big data parameters of volume, variety, and velocity. The good news is that once you pick up some steam, big data opens the door to myriad amazing business possibilities you hadn't even considered, and it starts to generate its own momentum.

Here's how to get unstuck: Consider the problems you're trying to solve with relational databases and whether other technologies might be more appropriate from a feature perspective. Tackle the limits around the "three V's." And start exploring comprehensive data platforms that can take you beyond simply knowing what a customer is doing to understanding why.

Predict the click
A typical first foray into big data involves attempting to analyze massive amounts of log or event data to identify causal patterns, also called clickstream analysis. What are the top three things mobile users do immediately before they uninstall your app? Can IT identify suspicious behaviors in server logs before someone steals data? How do you detect changes in sensor output that are significant enough to trigger dispatching a technician?

We've used relational databases to tackle these questions for many years, both directly and through enterprise data warehouses. GE Power & Water, for example, has monitored industrial turbines and used that data to predict maintenance needs for a decade.

However, a conventional data warehouse gets you only so far when you rack up 100 million hours of operating and maintenance data for 1,700 turbines, and it won't let you mash all that up with external data, such as weather information, to predict failures. GE Capital CIO Jim Fowler, in a discussion at the 2014 InformationWeek Conference (when he was still in the role of Power & Water CIO), said investments in new platforms, such as Hadoop and NoSQL databases, to crunch external sources with the terabyte of data per day spinning off of each of its sensor-equipped turbines, should net $66 billion in savings over the next 15 years.  

"We've seen the cartel of database vendors broken up, and some great new entrants give us new capabilities that we've never had before at a cost that we've never seen," Fowler said, specifically calling out MongoDB, Talend, and Pivotal, in which GE has invested.

Those savings and that cartel breakup are key, as we'll discuss.

Volume isn't the only challenge, though. Using relational databases to find out "why" is also challenging because of the sheer amount of work it takes to formulate, ask, and answer questions.

For example, to create useful queries about website clickstreams and user application activity logs, you need to "sessionize" the data -- that is, take data in which every row is an event and group together all events from a single "session," so you can ask what happened prior to a particular type of event, such as your mobile app getting uninstalled or a turbine going offline. HP Vertica and Hadoop have offered sessionization features for several years; ParAccel (which underpins Redshift from Amazon Web Services) introduced it last year; and as of earlier this year, Oracle 12c is on board.

To read the rest of this story,
download the entire InformationWeek Tech Digest issue, distributed in an all-digital format (registration required).
Joe Emison is a serial technical cofounder, most recently with BuildFax, the nation's premier aggregator and supplier of property condition information to insurers, appraisers, and real estate agents. After BuildFax was acquired by DMGT, Joe worked with DMGT's portfolio ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Remote Work Tops SF, NYC for Most High-Paying Job Openings
Jessica Davis, Senior Editor, Enterprise Apps,  7/20/2021
Blockchain Gets Real Across Industries
Lisa Morgan, Freelance Writer,  7/22/2021
Seeking a Competitive Edge vs. Chasing Savings in the Cloud
Joao-Pierre S. Ruth, Senior Writer,  7/19/2021
White Papers
Register for InformationWeek Newsletters
Current Issue
Monitoring Critical Cloud Workloads Report
In this report, our experts will discuss how to advance your ability to monitor critical workloads as they move about the various cloud platforms in your company.
Flash Poll