12 Hadoop Vendors To Watch In 2012 2

Promising low cost and unheard of scalability, Hadoop has been called the next-generation platform for data processing. Check out the vendors taking Hadoop to the next level.

Doug Henschen, Executive Editor, Enterprise Apps

January 23, 2012

13 Slides

12 Hadoop Vendors To Watch In 2012
Hadoop has been called the next-generation platform for data processing because it offers low cost and the ultimate in scalability. But Hadoop is still immature and will need serious work by the community--including the 12 vendors described here--to turn this fledgling baby elephant into an industry colossus.

Hadoop is at the center of this decade's big data revolution. This Java-based framework is actually a collection of software and subprojects for distributed processing of huge volumes of data. The core approach is MapReduce, a technique used to boil down tens or even hundreds of terabytes of Internet clickstream data, log-file data, network traffic streams, or masses of text from social network feeds.

Excitement has been building around Hadoop since its release as an Apache open source project in 2008, thanks to its combination of low cost, scalability, and flexibility to handle any data without building predefined schemas. Many people see in Hadoop the potential to usher in a whole new generation of data-processing capabilities, just as Structured Query Language (SQL) ushered in a revolution in data computing more than 30 years ago.

But Hadoop is immature and, in some ways, downright crude compared to SQL. Pioneers, most of whom started working on the framework at Internet giants such as Yahoo, have already put at least six years into developing Hadoop. But success has brought mainstream demand for stability, robust administrative and management capabilities, and the kind of rich functionality available in the SQL world.

All eyes are now on Hadoop vendors, a fast-growing community, to deliver robust tools, capabilities, and innovations. Leading lights in that community include Cloudera and Amazon Web Services. Cloudera was the first and is now the largest source of Hadoop software with its CDH distribution and accompanying management software. It's also the largest provider of enterprise support and training for Hadoop. Amazon was an early mover in running Hadoop in a public cloud with its Amazon Elastic MapReduce service.

In 2011, MapR and Hortonworks, the latter a Yahoo spinoff, burst onto the scene with announcements about their own distributions of Hadoop software along with support, training services and, in MapR's case, proprietary twists aimed at delivering high performance. Competition is part of what it will take to improve Hadoop, so the availability of more distributions, and new support and training options should benefit everyone.

Data processing is one thing, but what most Hadoop users ultimately want to do is analyze the data. Enter Hadoop-specialized data access, business intelligence, and analytics vendors such as Datameer, Hadapt, and Karmasphere.

The clearest sign that Hadoop is headed mainstream is that fact that it was embraced by five major database and data management vendors in 2011, with EMC, IBM, Informatica, Microsoft, and Oracle all throwing their hats into the Hadoop ring. IBM and EMC released their own distributions last year, the latter in partnership with MapR. Microsoft and Oracle have partnered with Hortonworks and Cloudera, respectively. Both EMC and Oracle have delivered purpose-built appliances that are ready to run Hadoop. Informatica has extended its data-integration platform to support Hadoop, and it's also bringing its parsing and data-transformation code directly into the environment. Read on to learn more about what these influential vendors are doing with Hadoop.

Amazon Delivers MapReduce As A Service
No Johnny come lately to Hadoop, Amazon Web Services introduced Amazon Elastic MapReduce way back in 2009. So Amazon has intimate knowledge of Hadoop demand and applications, from the newbies running pilot projects to grizzled veterans tapping Elastic MapReduce for additional capacity as on-premise deployments hit demand overloads.

Elastic MapReduce is a rapidly scalable Web service that runs on the Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). This is no false cloud: you can instantly provision as much capacity as need for data-intensive tasks such as Web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research.

Beyond data processing, you can also use a services-based version of Karmasphere Analyst, a visual workspace for analyzing data on Amazon Elastic MapReduce. Karmasphere provides visual tools to use SQL and other languages to do ad-hoc queries and analyses of structured and unstructured data located on Amazon S3, Amazon Elastic MapReduce job flows, or local file systems. You also can also extract result files for use in with databases or tools such as Microsoft Excel or Tableau.

Cloudera Makes Hadoop Safe For The Enterprise
The oldest and largest Hadoop software and services provider, Cloudera has been focused on making open-source Apache Hadoop a reliable platform for business use since 2008. The company has more than 100 customers, but that could jump dramatically in the year ahead given Cloudera's recent partnership with Oracle, the IT industry's dominant database supplier.

Cloudera adds two crucial elements to its distribution of Apache Hadoop Software: the Cloudera Manager console for administering and managing Hadoop deployments, and enterprise-grade support. Cloudera Manager offers wizard-based installation and configuration menus for deploying Hadoop. It then offers tools to help system managers monitor the health of the platform, diagnose problems, optimize performance, and make required configuration and security changes.

Cloudera support is available eight hours a day, five days a week or 24 hours a day, seven days a week, with services including configuration checks, escalation and issue resolution, integration with third-party systems, and knowledge bases, articles, and other technical resources. Training and consulting round out the available services. The Cloudera Enterprise combination of the Hadoop software distribution, Cloudera Manager and support is list priced at $4,000 per node, per year (not including hardware.)

Datameer Applies Business Intelligence To Big Data
Datameer touts its Datameer Analytics Solution (DAS) as a business-user-focused business intelligence (BI) platform for Hadoop. But DAS doesn't treat Hadoop as an island of information; it can connect to any data source through JDBC, Hive, HTTP, or other standards. It includes a wizard-driven integration platform that lets you schedule loads and transform large structured, semi-stuctured or unstructured datasets from any of these sources. Then you can apply any of more than 180 analytic functions through the spreadsheet-like DAS interface. Business users get drag-and-drop reporting and dashboarding capabilities. DAS runs on private or public clouds with a REST API available for data imports and exports.

EMC Delivers Single Platform For Data Analytics
EMC describes its EMC Greenplum Unified Analytics Platform (UAP) as a single software platform on which data and analytics teams can seamlessly share information and collaborate on analyses without having to work in--or move data between--separate silos. As such, UAP includes the ECM Greenplum relational database, the EMC Greenplum HD Hadoop distribution, and EMC Greenplum Chorus, which is a collaborative, social-network-style interface for data-analysis teams, from the PhD data scientists to the data-integration experts and BI analysts, to the DBAs and line-of-business users and managers.

EMC's hardware for big data is the modular EMC Data Computing Appliance (DCA), which is capable of running and scaling up both the Greenplum relational database and Greenplum HD nodes within a single box. The DCA offers a shared Command Center interface that lets administrators monitor, manage, and provision both Greenplum database and Hadoop system performance and capacity. The UAP software unifies data access, management, and workflow, and ties to other data sources, data-processing approaches, and analytic capabilities are expected to multiply as the platform matures.

Hadapt Unites Relational And Hadoop Environments
Hive, the Apache data warehousing component that runs on top of Hadoop, has a reputation for being slow. Enter Hadapt, which provides an all-in-one analytics environment designed to handle analysis across data in Hadoop, as well as conventional structured data in SQL environments. Hadapt says the usual approach of using two separate systems linked with bolt-on connectors introduces delays and promotes a siloed approach. Hadapt's platform, which is designed to be run on private or public cloud environments, provides access to all data from one environment, so existing SQL-based tools can be used as well as MapReduce processes and big-data analytics. Hadapt automatically splits query execution between the Hadoop and relational database layers, providing what Hadapt describes as an optimized environment that leverages the scalability of Hadoop and the speed of relational database technology.

Hortonworks Taps Yahoo's Hadoop Legacy
Hortonworks was spun out of Yahoo in 2011, bringing a core team of nearly 50 of some of Hadoop's earliest and most prolific contributors into an independent company focused entirely on the advancing the open-source platform. Hortonworks executives assert that this Yahoo team developed a big share of the code behind the Hadoop platform and will be instrumental in guiding the future of the platform.

Hortonworks's first major vote of confidence (beyond attaining venture capital funding) was Microsoft's October partnership, through which Hortonworks will help the firm develop a Windows-compatible version of Hadoop that stays true to the Apache Open Source project. Hortonworks followed up in November with Hortonworks Data Platform (HDP) v1, a distribution of the Hadoop platform that will soon be updated to a v2 version in Q1 2012 incorporating the latest (0.23) Apache Hadoop release. Hortonworks also provides Hadoop support, training, and consulting, stepping up competition to Cloudera and MapR.

IBM Offers BigInsights, BigSheets And A Big Cloud
IBM started experimenting with Hadoop in its labs several years ago, but it took products and services into commercial release last year, before Oracle and Microsoft announced they, too, would embrace the platform. IBM introduced InfoSphere BigInsights software in May. The software package includes a distribution of Apache Hadoop, the Pig programming language for MapReduce programming, connectors to IBM's DB2 database, and IBM BigSheets, a browser-based, spreadsheet-metaphor interface for exploring data within Hadoop.

IBM followed up in October by making BigInsights and BigSheets available as a service through IBM's SmartCloud Enterprise infrastructure. Basic and enterprise versions of the service are available, and the big draw is learning about and experimenting with big-data processing and analysis without having to invest in supporting hardware or IT expertise. Customers can set up and move data into Hadoop clusters in less than 30 minutes, according to IBM, and data-processing rates start at 60 cents per cluster, per hour.

Informatica
A number of data-integration and data-management vendors (IBM, Oracle, Syncsort, Talend) have tackled the obvious: getting data into and out of Hadoop. Informatica went a step further in October when it introduced HParser, a data-transformation environment optimized for Hadoop. The software supports processing of any file format inside Hadoop with scale and efficiency, according to Informatica, giving Hadoop developers out-of-the-box parsing capabilities to address complex and varied data sources, including logs, documents, binary or hierarchical data, and industry standard formats (such as NACHA in banking, SWIFT for payments, FIX for financial data, and ACORD for insurance). Just as in-database processing speeds various analytic approaches, Informatica is putting parsing and, soon, other data-processing code inside Hadoop to take advantage of all that processing power.

Informatica aims to provide a single platform that can handle the sweep of data-management and data-integration needs with a consistent environment and approach. The company has more than 4,300 customer firms, and it estimates more than 10% are moving into the big-data realm (exceeding 100 terabytes). Market presence and innovation make Informatica a Hadoop-savvy vendor to watch.

Karmasphere Masters Hadoop Data Analysis
Lots of vendors from the conventional business intelligence world (Jaspersoft, Pentaho, Tableau Software, and others) are now pointing their tools and technologies at Hadoop as a data source. But Karmasphere has been helping data professionals mine and analyze Web, mobile, sensor, and social media data in Hadoop since 2010.

Karmasphere offers direct access to structured and unstructured data inside Hadoop, and it can apply SQL and other languages for ad-hoc querying and further analysis. Karmasphere Analyst is the core collaborative workspace for data professionals and data analysts to get direct access to structured and unstructured data in Hadoop. Using SQL and other languages, users can create ad-hoc queries and then interact with the results. Karmasphere Studio gives developers a graphical environment in which to develop custom algorithms and create useful datasets for applications and repeatable production processes. Karmasphere has partnered with a Hadoop Who's Who list, and Karmasphere Analyst and Studio for Amazon Elastic Map Reduce applies the tools to one of the leading cloud-based MapReduce services.

MapR Technologies Claims Better Performance
MapR is a bit of a rabble rouser in the Hadoop arena, offering a unique distribution. This takes what the company wants from the open-source Apache project, while spitting out components it doesn't like (particularly HDFS, which MapR considers a single point of failure, and replaces it with the Unix-based Network File System). This competitor to Cloudera and Hortonworks combines its M5 commercial Hadoop distribution with support, training, and consulting (an M3 distribution is free and 100% compatible with Apache Hadoop). MapR is partnered with EMC, which has adopted M5 as the basis of its EMC Greenplum HD Enterprise Edition.

The latest (0.23) version of Hadoop addresses many of MapR's complaints about the Hadoop architecture, but that hasn't stopped the company from continuing to push the performance envelope, claiming to offer faster performance than conventional Hadoop distributions while requiring half the hardware.

Microsoft
EMC, IBM, and Oracle embraced Hadoop in a big way in 2011, so it shouldn't be surprising that Microsoft is also getting in on the game. Microsoft introduced a beta Hadoop service on the Azure cloud platform last year, and this year it's promising a Windows-compatible Hadoop-based Big Data Solution as part of its Microsoft SQL Server 2012 release (debut date unknown).

Running on Windows will be a new trick for an open source platform that has heretofore run on Linux. Will Microsoft's release be free and open source? That has yet to be announced, and there's also no word on whether there will be supporting appliances on third-party hardware, as there are (with HP and others) for the SQL Server Parallel Data Warehouse. Microsoft executives insist the distribution will be "consistent and compatible with the Apache Hadoop core." That's likely to be true given that Microsoft has partnered with Hortonworks, a Yahoo Spinoff that specializes in Hadoop, to develop the software distributions and propose contributions back to the Hadoop community.

Oracle
The Oracle Big Data Appliance, released in January, combines an Oracle-Sun distributed computing platform with Cloudera's distribution of Apache Hadoop and the Cloudera Manager admin and management console, an open-source distribution of R analytics software, and the Oracle NoSQL database. Oracle also includes connectors that enable data to be passed back and forth between the Big Data Appliance and Oracle Exadata, or conventional Oracle database deployments. Oracle provides first-line support for this combined hardware-software "engineered system," but if tough Hadoop challenges arise, Oracle will tap Cloudera's expertise, and it will also refer customers to Cloudera for Hadoop training and consulting.

Customers will be able to configure and use the Big Data Appliance bundled software as they like. It could be all Hadoop, all NoSQL or a split of nodes on the same platform. The appliance is offered exclusively in full-rack configurations, with each rack having 864 gigabytes of main memory, 216 CPU cores, 648 terabytes of raw disk storage, and 40 gigabit-per-second InifiniBand internal connectivity between nodes. The hardware and software combined will sell for $450,000, with an annual support fee for both hardware and software of 12%. That's competitive, working out to less than $700 per terabyte.

About the Author(s)

Doug Henschen

Executive Editor, Enterprise Apps

Doug Henschen is Executive Editor of InformationWeek, where he covers the intersection of enterprise applications with information management, business intelligence, big data and analytics. He previously served as editor in chief of Intelligent Enterprise, editor in chief of Transform Magazine, and Executive Editor at DM News. He has covered IT and data-driven marketing for more than 15 years.

See more from Doug Henschen

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

12 Hadoop Vendors To Watch In 2012 2

About the Author(s)

Editor's Choice

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

<span class="ArticleBase-LargeTitle">12 Hadoop Vendors To Watch In 2012 2</span>12 Hadoop Vendors To Watch In 2012 2

About the Author(s)

Editor's Choice

12 Hadoop Vendors To Watch In 2012 2