We sort through this week's clashing din of news from Teradata, SAS, Pivotal, Platfora and Hortonworks in search of the inside edge to big data breakthroughs.
Pivotal Plays Enterprise Card
A future-minded spinoff of EMC, Pivotal blends cloud, application development and big data capabilities. The "cloud fabric" is based on Cloud Foundry platform-as-a-service (PaaS) software and expertise from VMware. The application-development expertise and technology comes from Pivotal Labs, contributed by EMC, and VMware's SpringSource unit. The big data and analytics capabilities blend Hadoop and EMC's Greenplum database.
The combination of Hadoop and Greenplum produced HAWQ (Hadoop with SQL), an SQL-on-Hadoop querying capability that's part of the company's Pivotal HD Hadoop distribution. This is an alternative that Pivotal says far surpasses both Hadoop's Hive component and Cloudera's Impala in performance.
Pivotal also has VMware's GemFire in-memory caching technology, which has been integrated with the Pivotal HD Hadoop distribution and introduced this week as Pivotal GemFire XD. The goal is to bring real-time, in-memory data services to Hadoop.
Pivotal freely admits the GemFire XD overlaps with Hadoop community offerings including HBase (the NoSQL database) and Spark (the Hadoop-based in-memory option), but it insists that customers are free to choose whatever components they want.
"The community is investing in technologies, and more often than not, they are good enough for the Internet companies," said Pivotal's Susheel Kaushik, senior director of technical product marketing, in an interview with InformationWeek. "When you look at enterprises, they're looking for reliability, failover, availability and standard interfaces. That's what we're providing to enterprises."
The community options will take time to evolve, Kaushik said. He vowed that Pivotal will contribute to efforts such as Spark and Storm (stream processing), but the contribution will stop short of donating assets such as GemFire XD to open source.
Pivotal also announced this week Pivotal Data Dispatch, which is described as an iTunes-like interface for discovering data within Hadoop as well as in other data stores. You can select the data sets of interest and create a big data sandbox.
Data Dispatch offers helpful controls and insights including access, rights and data lineage. It's a management framework rather than a data store, so it's not creating copies of data. Rather it's a Web portal to all available data for big data exploration.
The Hadoop community also is working on the problems of access controls, rights management and data lineage, but there's enough chaos (with different distributors proposing different tools) and immaturity for commercial vendors to exploit. It's not unlike the Entity-Centric Data Catalog introduced by Platfora, though Data Dispatch also catalogs data available outside of Hadoop.
Enterprise Vs. Internet Focus
The easiest way to understand these companies is to look at their customer bases. Where Teradata, SAS and Pivotal are clearly playing to their enterprise roots, Platfora has the clean-slate freedom to tackle those "over-the-horizon" thinkers trying to address more holistic big data opportunities. Both camps are offering commercial tools that fill gaps in current open source offerings.
Attending this week's Big Data Conference in Chicago, it struck me that a number of practitioners and panelists -- ACE Group Insurance, Tenet Healthcare, ThinkBig Analytics -- said their big data teams were quite separate from preexisting BI, data warehousing and data management teams. They work together and collaborate, certainly, but big data initiatives are about finding new insights and pioneering new applications, products and businesses.
"If you're going to be a pioneer, you better have some wilderness survival skills," said Scott Rose, VP of services at analytics consulting firm ThinkBig Analytics.
That made me think about this week's release of Hortonworks Data Platform 2.0, with entirely open source components including HBase, ZooKeeper, Pig, Hive, HCatalog, Sqoop, Flume and Mahout. If you're going to be a big data pioneer, maybe you should be prepared to deal with many of these, in some cases, still-primitive tools. They're not exactly stone knives and bearskins, but nor are they as slick, feature-packed and mature as some of the commercial offerings.
If you want to be a big data settler, you might hitch your wagon to a commercial vendor such as Teradata, SAS, Pivotal and even Platfora, with enterprise-focused options promising reliability, failover, availability and standard interfaces. You'll still be ahead of the crowd back in the land of purely structured data, but you'll get some of the creature comforts of civilization.
IT leaders must know the trade-offs they face to get NoSQL's scalability, flexibility and cost savings. Also in the When NoSQL Makes Sense issue of InformationWeek: Oregon's experience building an Obamacare exchange. (Free registration required.)
6 Tools to Protect Big DataMost IT teams have their conventional databases covered in terms of security and business continuity. But as we enter the era of big data, Hadoop, and NoSQL, protection schemes need to evolve. In fact, big data could drive the next big security strategy shift.
Big Data Brings Big Security ProblemsWhy should big data be more difficult to secure? In a word, variety. But the business wonít wait to use it to predict customer behavior, find correlations across disparate data sources, predict fraud or financial risk, and more.