Big Data: How To Pick Your Platform
Hadoop? A high-scale relational database? NoSQL? Event-processing technology? One size doesn't fit all. Here's how to decide.
Download this entire InformationWeek Tech Digest issue, distributed in an all-digital format (registration required).
Feel stuck in neutral? Don't worry. Big data success stories tend to start slowly, for two reasons.
First, there's the drag exerted by relational database administrators who badly want to stick to what they know. Second, big data problems have just as much to do with changing how you do data querying and processing as they do with handling the oft-cited "three V's" -- the big data parameters of volume, variety, and velocity. The good news is that once you pick up some steam, big data opens the door to myriad amazing business possibilities you hadn't even considered, and it starts to generate its own momentum.
Here's how to get unstuck: Consider the problems you're trying to solve with relational databases and whether other technologies might be more appropriate from a feature perspective. Tackle the limits around the "three V's." And start exploring comprehensive data platforms that can take you beyond simply knowing what a customer is doing to understanding why.
Predict the click
A typical first foray into big data involves attempting to analyze massive amounts of log or event data to identify causal patterns, also called clickstream analysis. What are the top three things mobile users do immediately before they uninstall your app? Can IT identify suspicious behaviors in server logs before someone steals data? How do you detect changes in sensor output that are significant enough to trigger dispatching a technician?
We've used relational databases to tackle these questions for many years, both directly and through enterprise data warehouses. GE Power & Water, for example, has monitored industrial turbines and used that data to predict maintenance needs for a decade.
However, a conventional data warehouse gets you only so far when you rack up 100 million hours of operating and maintenance data for 1,700 turbines, and it won't let you mash all that up with external data, such as weather information, to predict failures. GE Capital CIO Jim Fowler, in a discussion at the 2014 InformationWeek Conference (when he was still in the role of Power & Water CIO), said investments in new platforms, such as Hadoop and NoSQL databases, to crunch external sources with the terabyte of data per day spinning off of each of its sensor-equipped turbines, should net $66 billion in savings over the next 15 years.
"We've seen the cartel of database vendors broken up, and some great new entrants give us new capabilities that we've never had before at a cost that we've never seen," Fowler said, specifically calling out MongoDB, Talend, and Pivotal, in which GE has invested.
Those savings and that cartel breakup are key, as we'll discuss.
Volume isn't the only challenge, though. Using relational databases to find out "why" is also challenging because of the sheer amount of work it takes to formulate, ask, and answer questions.
For example, to create useful queries about website clickstreams and user application activity logs, you need to "sessionize" the data -- that is, take data in which every row is an event and group together all events from a single "session," so you can ask what happened prior to a particular type of event, such as your mobile app getting uninstalled or a turbine going offline. HP Vertica and Hadoop have offered sessionization features for several years; ParAccel (which underpins Redshift from Amazon Web Services) introduced it last year; and as of earlier this year, Oracle 12c is on board.
To read the rest of this story,
download the entire InformationWeek Tech Digest issue, distributed in an all-digital format (registration required).
About the Author
You May Also Like