Hadoop is steadily gaining adoption as an enterprise platform for capturing high-scale and highly variable data that's not easy or economically viable to store in relational databases. What's less clear is just how companies are going to analyze all this data.
A recent Forrester report declared that Hadoop is "no longer optional" for large enterprises. Our data suggests that train hasn't left the station just yet: Just 4% of companies use Hadoop extensively, while 18% say they use it on a limited basis, according to our just-released 2015 InformationWeek Analytics, Business Intelligence, and Information Management Survey. That is up from the 3% reporting extensive use and 12% reporting limited use of Hadoop in our survey last year. Another 20% plan to use Hadoop, though that still leaves 58% with no plans to use it.
But there's no doubt that interest in Hadoop is rising. The top draw is the platform's "ability to store and process semi-structured, unstructured, and variable data," cited by 31% of the 374 respondents to our survey involved with information management technology. Another 30% cited Hadoop's ability to handle "massive volumes of data," while 25% said it's Hadoop's "lower hardware and storage scaling costs" as compared to conventional relational database management systems.
That's the IT, data-management perspective on the need for Hadoop. But why is the business looking to capture and analyze big data in the first place? The top driver, cited by 48% of respondents using or planning to deploy data analytics, BI, or statistical analysis software, is finding correlations across multiple, disparate data sources, like Internet clickstreams, geospatial data, and customer-transaction data. Next in line are predicting customer behavior, cited by 46%, and predicting product or service sales, cited by 40% of respondents (multiple responses allowed, see chart below). Other motivations include predicting fraud and financial risks, analyzing social network comments for customer sentiment, and identifying security risks.
In each of these examples, companies are searching for insight by analyzing big data sets that they couldn't discover parsing the same old data they've long held in transactional systems alone. Capturing and analyzing clickstreams, server log files, social network streams, and geospatial data from mobile apps is a recent, big-data-era phenomenon for most organizations attempting it, and they're gaining insights and seeing correlations that just weren't available in the enterprise data warehouse.
But pulling insight out of this new data will require some new tools, ones that work alongside Hadoop -- which is, at its core, nothing more than a highly distributed file system. Here are the three categories of options associated with Hadoop, along with product examples.
Hadoop-native data-processing and analysis options: These include Apache Hive (provides SQL-like data access -- think data warehousing meets Hadoop); Apache Mahout (supports machine learning on top of Hadoop -- think finding patterns in data); Apache MapReduce (for searching, filtering, sorting, and forms of processing large data sets in Hadoop -- ways to boil down really big data to find the useful nuggets); and Apache Pig (a language for writing MapReduce jobs).
Alternative SQL access/analysis options: Hive is slow by relational database standards, and it doesn't support all SQL-analysis capabilities. These alternatives are designed to make BI professionals feel more at home, giving them accustomed performance, SQL- or SQL-like querying, and compatibility with current BI tools. Examples include Actian Analytics Platform SQL Hadoop Edition, Apache Drill, Cloudera Impala, HP Vertica For SQL on Hadoop, IBM Big SQL, Microsoft SQL Server Polybase, Oracle Big Data SQL, Pivotal HAWQ, and Teradata Query Grid.
Analytics and BI options designed to run on Hadoop: These tools blend SQL and BI-type querying with big-data-oriented and advanced analytics capabilities. Examples include Apache Spark, Apache Storm, Datameer, Platfora, and SAS Visual Analytics. Many of these analysis engines now run on Hadoop 2.0's YARN resource-management system.
The first thing to note is that the SQL and SQL-like options -- including Hive, Impala, Drill, the various relational databases ported to run on Hadoop (Actian, HP, Pivotal), and the various SQL-access options (Microsoft, Oracle, Teradata) -- give you the basics of SQL query and analysis, but these are not alternatives to analytics workbenches or business intelligence suites. As noted, a key point of these query and access tools is making Hadoop compatible with incumbent SQL-connected products like BusinessObjects, Cognos, MicroStrategy, OBIEE, Tableau Software, and so on.
Businesses are demanding compatibility with tools that they already have on hand. This helps explain why there were so many SQL-on-Hadoop announcements from both Hadoop vendors (Cloudera, Hortonworks, MapR) and database incumbents (Actian, Hewlett-Packard, Oracle) over the last year.
But companies need more than SQL. The value in big data analysis is often in finding correlations among disparate data sets or insights hidden in semi-structured or highly variable data sources, such as
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
IT Strategies to Conquer the CloudChances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.