The Apache Hadoop big-data platform is still adolescent, but Hadoop distributor Cloudera on Wednesday introduced a maturity milestone in the form of Cloudera Sentry, a new role-based security access control project that will enable companies to set rules for data access down to the level of servers, databases, tables, views and even portions of underlying files.
Hadoop already has provisions for perimeter security, with options including open-source Kerberos, Oozie and Knox for user authentication. But once users are in, what Hadoop has lacked has been a way to define which users have access to what. That has left security-conscious organizations such as banks, insurance companies, healthcare organizations and government agencies with two bad options: tightly restricting access to certain data sets to a select few users or entirely avoiding moving certain types of data onto Hadoop clusters.
With Sentry, Cloudera says it can support four common security requests. First, security administrators can use Sentry to set specific access control privileges for authenticated users. Second, it provides for fine-grained access to subsets of data within files based on defined roles. A fine-grained view might let users see certain columns related to customers while preventing access to their financial information.
[ Want more on Cloudera's fast SQL query option? Read Cloudera Impala Brings SQL Querying To Hadoop. ]
Third, role-based rules can be established whereby a fraud-detection group might get access to financial records whereas a business analyst group would not have access to that information. Finally, Sentry also supports multi-tenant security administration, which enables customers of service providers to set their own security controls without having to go through a higher-level administrator.
"Sentry will enable our customers to store more sensitive data within Hadoop and open up access to information to more users knowing that they have control over more use cases and applications," said Justin Erickson, Cloudera's director of product management, in a phone interview with InformationWeek.
For now, Sentry works with Apache Hive, through HiveServer2, and Cloudera Impala, through a new Impala 1.1 release also announced Wednesday. Cloudera plans to go beyond Hive and Impala to extend security controls to other components of the Hadoop framework, according to Erickson. Hive and Impala were chosen as a starting point because they support SQL-style access to data, but directly by users and through business intelligence applications and ETL tools.
Hive is a well-established open-source query infrastructure that runs on top of Hadoop, but it's notoriously slow because it relies on MapReduce processing running behind the scenes. Impala is a Cloudera-developed, SQL-on-Hadoop component that supports direct querying of data in the Hadoop Distributed File System (HDFS) and HBase (NoSQL database) indexes. Cloudera says Impala querying is three to 30 times faster than Hive.
Cloudera has contributed Impala to the open-source community, but it's the only vendor likely to support it. For one thing, management and monitoring of Impala queries is something you do through Cloudera's subscription-based commercial management console. For another, all of Cloudera's rivals have introduced or are working on their own SQL-on-Hadoop tools. The list includes Hortonworks-supported Stinger, MapR-supported Drill, Pivotal's proprietary HAWQ engine and IBM-supported BigSQL.
Cloudera said Sentry, too, will be contributed to the open-source community and will be an Apache-licensed project. Cloudera isn't the only vendor working on Hadoop Security, but this is an area where a consistent approach across all vendors will be crucial to Hadoop's long-term success. Hortonworks, Cloudera's biggest rival, could not be reached in time for comment.