Commentary
7/21/2014
01:42 PM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
Google+
LinkedIn
Twitter
RSS

Oracle Big Data SQL: 5 Key Points

Oracle's new big data tool won't cover all the analysis bases, but it will enable SQL-savvy professionals to query Hadoop and NoSQL sources.



Oracle announced last week that it will open up access to Hadoop and NoSQL data with Oracle Big Data SQL, a feature to be added to the Oracle Big Data Appliance in the third quarter. The new tool has some limitations, as this article describes, but the good news is that it will enable Oracle Database shops to take better advantage of big data using existing skills and expertise.

We were on the right track last week when compared Oracle Big Data SQL to Teradata Query Grid and Microsoft PolyBase. All three technologies are about SQL querying across databases and big data platforms, and all three ultimately move data to the vendor's respective SQL database. There are differences under the hood that will make a difference for Oracle customers. We'll get to these nuances in a moment, but what's encouraging is that Oracle is not presenting this SQL tool like a hammer and all big-data-analysis challenges like nails. The idea is simply to enable SQL-trained professionals to do as much as possible with information from Hadoop and NoSQL sources from the familiar environs of Oracle Database.

[Want more on the Spark option for big data analysis? Read Databricks Spark Plans: Big Data Q&A.]

Like many Oracle customers, we watched last week's Oracle Big Data SQL launch presentation and heard about all the advantages of this feature. In a follow-up interview with Oracle executives Dan McClary, product manager, and Neil Mendelson, VP of product management, we asked about limitations and got more detail on how this feature works. We also got a frank assessment of what Oracle Big Data SQL can and can't do. For example, McClary and Mendelson were clear in saying that Oracle Big Data SQL is not a SQL-on-Hadoop tool intended to replace Hive, Impala, or other analysis options that operate exclusively on Hadoop.

Oracle Big Data SQL was used to create this geospatial correlation of Twitter sentiment data stored on Hadoop with customer profitability data managed in Oracle Database.
Oracle Big Data SQL was used to create this geospatial correlation of Twitter sentiment data stored on Hadoop with customer profitability data managed in Oracle Database.

Here, then, are five key points would-be customers should know about Oracle Big Data SQL:

1. Access is limited to Oracle's appliance, Cloudera's software, and, at first, Oracle NoSQL Database. Oracle Bid Data SQL is a feature of the Oracle Big Data Appliance, so that's the only place it can run. At this point it's not planned to be available as stand-alone software for use with Hadoop deployed on non-Oracle hardware. What's more, Oracle execs said there are no plans to make it run on any Hadoop distribution other than Cloudera -- the software bundled with the Oracle Big Data Appliance.

The feature will also be limited to working with the Oracle NoSQL Database, which is the other software component in the Oracle Big Data Appliance bundle. Here, at least, there are plans to open up access to non-Oracle products, including Cassandra, Hbase, and MongoDB.

"The Hadoop community has been very good about coming up with data storage handlers for Hive, so we'll use those to consume data from a number of other NoSQL data stores," McClary explained. This move is "at the top of our list," he said, but it will have to wait for a subsequently release of Oracle Big Data SQL.

The sooner Oracle can add support for the most popular NoSQL databases the better. Teradata Query Grid, by contrast, offers direct access to MongoDB. As for the limitation of working only with the Oracle Big Data Appliance and Cloudera software, we think Oracle should rethink this approach, as many companies have deployed Cloudera and other Hadoop distributions without using Oracle's appliance. Teradata Query Grid and Microsoft PolyBase are not limited to specific big data appliances or Hadoop distributions. Why not bundle Oracle Big Data SQL with Oracle Database instead of the Big Data Appliance?

2. Oracle Smart Scan minimizes data movement. Oracle made a virtue of necessity when it developed the Smart Scan feature for the Exadata appliance. The technology gave Oracle the power of distributed processing at a storage-tier level, boosting scalability without changing Oracle Database itself.

Smart Scan effectively prescreens data on the storage tier and brings only that which is relevant up to the database level. Oracle Big Data SQL will run Smart Scan on Hadoop using the metadata generated by Hive. Once again the feature minimizes data movement, in this case from Hadoop to Oracle database.

During Oracle's launch presentation, McClary shared the example of correlating Twitter data from Hadoop with customer transaction data in Oracle Database. Smart Scan first filtered out Tweets without discernable sentiments, eliminating more than 50% of the original data, and it then eliminated Tweets that lacked latitude and longitude information. The final subset represented less than 1% of the total Twitter stream in Hadoop, cutting data movement to Oracle Database (and thus query time) by 99%. All of this was accomplished with a single SQL query, according to McClary, and the final result was visualized with a map (shown above) pinpointing sentiment correlated with sales profitability by location.



3. DBAs get Oracle Database-style security controls. So Oracle Big Data SQL opens up table-based access to the data in Hadoop, but with access comes risk. Thus, Oracle gave this feature a way to apply the same kinds of grants, permissions, and policies that DBAs apply when they set up Oracle Database. You might have an "analyst" role defined in Oracle Database that is allowed to see and query some columns but not others, while certain fields of data might be redacted.

"If I want to expose that group of analysts to a set of data that's in Hadoop, I can create an external table in Oracle Database over that data in Hadoop and grant whatever permissions and policies you deem appropriate," McClary explained.

[Want more on the Spark option for big data analysis? Read Databricks Spark Plans: Big Data Q&A.]

4. Oracle Big Data SQL is not a SQL-on-Hadoop tool. This is an important distinction. Oracle Big Data SQL is not just a way to use Oracle SQL against Hadoop. It's a way to query Oracle Database, Hadoop, and NoSQL sources simultaneously.

"SQL on Hadoop is a great idea and we'll continue to ship solutions that provide that, including Impala, Hive, and future efforts to bring Hive on top of Spark," McClary said. "What we're trying to do here is solve a different and perhaps bigger problem, which is integrating big data with the rest of the enterprise architecture."

Oracle Big Data SQL is not a SQL-on-Hadoop option, it's a way to SQL query Oracle Database, Hadoop, and NoSQL sources simultaneously.
Oracle Big Data SQL is not a SQL-on-Hadoop option, it's a way to SQL query Oracle Database, Hadoop, and NoSQL sources simultaneously.

Describing Oracle Big Data SQL as "democratizing big data" and "making it consumable by people outside of Silicon Valley," McClary said the point is bringing the value found in big data sources "home into the business." Home, in this case, means into Oracle Database, where it can be analyzed by the many SQL-savvy professionals instead of just a priesthood of PhD-level data scientists.

5. Oracle Big Data SQL will not do everything. It was refreshing to hear Oracle grant that not everything can be expressed or discerned through SQL. Options like Apache Spark and the R language, for example, support machine learning and advanced analytical data manipulations and workflows that are "all valid," said McClary. "There's a place for SQL in reasoning and operating on large sets of data and there's a place for other languages in doing what they're best suited to handle," he said.

The point of Oracle Big Data SQL is accessing and analyzing data in Hadoop and NoSQL sources without requiring a new set of people with a new set of skills. "It's not enough to have big data experiments, you have to be able to operationalize it," said Mendelson. "That means that the people who are used to running your systems need to be able to provide secure access not just to the privileged few, but potentially to everyone."

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Email This  | 
Print  | 
RSS
More Insights
Copyright © 2020 UBM Electronics, A UBM company, All rights reserved. Privacy Policy | Terms of Service