On Monday, Sept. 28, Cloudera unveiled RecordService, which allows for singular security management across multiple Hadoop data access apps. In addition, the company detailed a second product called Kudu, which helps combine fast analytics and data updates.
Kudu and RecordService are currently in beta. They are being offered for free as open source apps, and are to be donated to the Apache Software Foundation eventually.
Kudu is a high-speed storage engine that bridges HBase (an open source, non-relational database) and HDFS (Hadoop Distributed File System). "Kudu is the culmination of a three-year R&D effort," said Matt Brandwein, director of product marketing at Cloudera.
Without Kudu, HBase and HDFS are hobbled by limitations. HDFS cannot change data once it is written, though it can append data to files. Updating means deleting and re-adding the files, Brandwein said. HBase is designed for rapid updating, but "it's not good for analytics."
Kudu "enables the combination of updating and analytics," he said. It also simplifies Hadoop architecture by reducing two workloads down to one, while still keeping the strengths of HDFS (storage) and HBase (building online applications). Bridging these two will permit the construction of a real-time online dashboard.
RecordService provides consistent security management across different data access apps, like Spark, Hive, and Impala. The challenge is that each has its own set of security guarantees when used without RecordService. Impala and Hive require control of "fine-grained data," while Spark gets by on coarser data security over rows and columns, Brandwein explained.
To solve this challenge, RecordService "sits between storage in Hadoop and accesses all engines in Hadoop." It brokers data requests, looking up permissions in Apache Sentry and presenting only the data the user is allowed to see. "In effect, it brings universal access control and enforcement to the system."
As a result, there are no loopholes a person could exploit by switching from one form of search to another. Each must follow the same pathway, passing through RecordService's filter.
[Learn more about what Cloudera is doing to advance Hadoop. See Cloudera Sees Spark Emerging As Hadoop Engine.]
Hadoop customers want to store and analyze data on one platform, and use one architecture instead of different architectures on different servers. Completing that singular platform is the challenge. "The pieces are there," Brandwein said. It is more a question of Hadoop reaching maturity, where those pieces are all in their proper place, working together.
"Hadoop is rapidly completing. I don't think we are there yet," he said. "The vision is not Hadoop being another database. We are reinventing how analytics are done."
Hadoop began life as a way to store and process big data.
Its most common use is ETL (extract, transform, and load), according to a recent study done by AtScale. Now the goal is to provide "an end-to-end analysis chain," collecting data in one place and working with it in multiple ways, Brandwein said.