Oracle's new big data tool won't cover all the analysis bases, but it will enable SQL-savvy professionals to query Hadoop and NoSQL sources.
Oracle announced last week that it will open up access to Hadoop and NoSQL data with Oracle Big Data SQL, a feature to be added to the Oracle Big Data Appliance in the third quarter. The new tool has some limitations, as this article describes, but the good news is that it will enable Oracle Database shops to take better advantage of big data using existing skills and expertise.
We were on the right track last week when compared Oracle Big Data SQL to Teradata Query Grid and Microsoft PolyBase. All three technologies are about SQL querying across databases and big data platforms, and all three ultimately move data to the vendor's respective SQL database. There are differences under the hood that will make a difference for Oracle customers. We'll get to these nuances in a moment, but what's encouraging is that Oracle is not presenting this SQL tool like a hammer and all big-data-analysis challenges like nails. The idea is simply to enable SQL-trained professionals to do as much as possible with information from Hadoop and NoSQL sources from the familiar environs of Oracle Database.
Like many Oracle customers, we watched last week's Oracle Big Data SQL launch presentation and heard about all the advantages of this feature. In a follow-up interview with Oracle executives Dan McClary, product manager, and Neil Mendelson, VP of product management, we asked about limitations and got more detail on how this feature works. We also got a frank assessment of what Oracle Big Data SQL can and can't do. For example, McClary and Mendelson were clear in saying that Oracle Big Data SQL is not a SQL-on-Hadoop tool intended to replace Hive, Impala, or other analysis options that operate exclusively on Hadoop.
Oracle Big Data SQL was used to create this geospatial correlation of Twitter sentiment data stored on Hadoop with customer profitability data managed in Oracle Database.
Here, then, are five key points would-be customers should know about Oracle Big Data SQL:
1. Access is limited to Oracle's appliance, Cloudera's software, and, at first, Oracle NoSQL Database. Oracle Bid Data SQL is a feature of the Oracle Big Data Appliance, so that's the only place it can run. At this point it's not planned to be available as stand-alone software for use with Hadoop deployed on non-Oracle hardware. What's more, Oracle execs said there are no plans to make it run on any Hadoop distribution other than Cloudera -- the software bundled with the Oracle Big Data Appliance.
The feature will also be limited to working with the Oracle NoSQL Database, which is the other software component in the Oracle Big Data Appliance bundle. Here, at least, there are plans to open up access to non-Oracle products, including Cassandra, Hbase, and MongoDB.
"The Hadoop community has been very good about coming up with data storage handlers for Hive, so we'll use those to consume data from a number of other NoSQL data stores," McClary explained. This move is "at the top of our list," he said, but it will have to wait for a subsequently release of Oracle Big Data SQL.
The sooner Oracle can add support for the most popular NoSQL databases the better. Teradata Query Grid, by contrast, offers direct access to MongoDB. As for the limitation of working only with the Oracle Big Data Appliance and Cloudera software, we think Oracle should rethink this approach, as many companies have deployed Cloudera and other Hadoop distributions without using Oracle's appliance. Teradata Query Grid and Microsoft PolyBase are not limited to specific big data appliances or Hadoop distributions. Why not bundle Oracle Big Data SQL with Oracle Database instead of the Big Data Appliance?
2. Oracle Smart Scan minimizes data movement. Oracle made a virtue of necessity when it developed the Smart Scan feature for the Exadata appliance. The technology gave Oracle the power of distributed processing at a storage-tier level, boosting scalability without changing Oracle Database itself.
Smart Scan effectively prescreens data on the storage tier and brings only that which is relevant up to the database level. Oracle Big Data SQL will run Smart Scan on Hadoop using the metadata generated by Hive. Once again the feature minimizes data movement, in this case from Hadoop to Oracle database.
During Oracle's launch presentation, McClary shared the example of correlating Twitter data from Hadoop with customer transaction data in Oracle Database. Smart Scan first filtered out Tweets without discernable sentiments, eliminating more than 50% of the original data, and it then eliminated Tweets that lacked latitude and longitude information. The final subset represented less than 1% of the total Twitter stream in Hadoop, cutting data movement to Oracle Database (and thus query time) by 99%. All of this was accomplished with a single SQL query, according to McClary, and the final result was visualized with a map (shown above) pinpointing sentiment correlated with sales profitability by location.
Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
IT Strategies to Conquer the CloudChances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.