MapR's Google Deal Marks Second Big Data Cloud Win
Just two weeks after inking a deal with Amazon Web Services, MapR gets an exclusive to run Hadoop services on the Google Compute Engine.
Google I/O: 10 Awesome Visions
(click image for larger view and for slideshow)
June was a good month for Hadoop software distributor MapR, landing not one, but two high-profile deals to provide the software for Hadoop services in the cloud.
MapR's latest deal is tied to Google's big June 28 announcement of the Google Compute Engine, new infrastructure-as-a-service (IaaS) that sets up the search giant as a public-cloud rival to Amazon Web Services (AWS). MapR is one of at least six partners debuting services on the Google infrastructure, which is currently in limited beta release. MapR and Google are currently signing up customers to join a private preview of the Hadoop services that will run on Google Compute Engine.
News of the Google partnership came just two weeks after MapR and Amazon announced that services based on its M3 and M5 Hadoop software distributions would be available on AWS. Where Amazon's own Elastic MapReduce service runs on Apache Hadoop, the MapR-based services add high-availability features not yet supported on standard open source software.
A key appeal of the AWS and Google services will likely be the ability to process and analyze data that already resides in the cloud. The MapR-based services on AWS, for example, are integrated with Amazon's Simple Storage Service (S3) and DynamoDB NoSQL database. Google AdWords and Google (Web) Analytics are both potentially rich, high-volume sources of search and click-stream data that Google Compute Engine customers could presumably tap without costly and time-consuming data-integration and data-movement steps.
"The big challenges in media are figuring out who to target, when to target, appropriate price points, and appropriate keyword bids, so you could easily see related digital media and advertising analyses performed on Google's cloud," MapR VP of marketing Jack Norris told InformationWeek.
By tapping compute capacity on demand, customers could potentially save money if they experience peaks and valleys in capacity utilization. In a test of Google Compute Engine performance, Norris said MapR recently tested its beta Hadoop service by setting up a 1,256-node cluster and running an industry-standard benchmark terasort job. The cloud-based system completed the job in one minute and 20 seconds, according to Norris, whereas the world record is one minute and two seconds.
"The record was set on a system that had twice as many cores, four times the number of disks, 200 more servers than the system we put together on the Compute Engine, and the cost of the infrastructure was in the neighborhood of $5 million," Norris said. "For the test that we ran on the Google Compute Engine, the cost would be about $16."
Comparable tests of MapR-based Hadoop clusters have not been performed on Amazon's infrastructure, Norris said. In the case of AWS, companies use the S3 services for everything from Web logs and click-through data to genomics data, and they use Amazon Elastic MapReduce and MapR-based Hadoop for analytics.
"The cloud is also an excellent target for business continuity, so instead of having a complete second data center, you can use run Hadoop clusters in the cloud, with mirroring synchronized between your on-premises and cloud-based targets," Norris said.
Some analysts say clould-based services will be prohibitively expensive for long-term storage at high scale, making them most attractive for pilot tests, brief projects, and cases where the data already exists in the cloud (as in the case of Google AdWords, Google Analytics, AWS S3, and DynamoDB). Norris took exception to that analysis.
"I think we're going to see generations of cloud services, and [costs at scale] are not going to be as much of a factor in the future," Norris said.
MapR distinguishes itself from Hadoop software distribution and support competitors Cloudera and Hortonworks by providing high-performance options not supported on standard Apache open source Hadoop software. MapR's M5 distribution, for example, replaces the Hadoop Distributed File System (HDFS) with a derivative of the Unix-based Network File System. M5 includes snapshotting, mirroring, and other high-availability features that aren't currently supported on the current (1.0) Hadoop code line.
MapR describes the AWS and Google services based on its distributions as an endorsement of its architecture, but there are plenty of options to run Cloudera and Hortonworks in the cloud. Hortonworks is the developer of the software used to run Hadoop on Microsoft's Azure public cloud. And multiple providers run Hadoop services on AWS and other public clouds using Cloudera's CDH Hadoop software distribution.
Responding to requests for comment on MapR's recent deals, Cloudera VP of product, Charles Zedlewski, said is a statement, "Cloudera has led the industry in support for Apache Hadoop on public clouds, supporting Rackspace, AWS, and Softlayer dating back to 2009. Every month, tens of thousands of CDH instances are created on top of various public cloud providers."
Zedlewski also noted that Cloudera developed Apache Whirr, software now used by Cloudera and its competitors to run Hadoop distributions on public clouds.
The entire Hadoop movement was actually inspired by Google, which was a pioneer in the use of MapReduce processing and published the white paper that guided the creators of Hadoop. Google still uses MapReduce processing extensively internally, but its software is not distributed and its approach to MapReduce is not made available as a service on the Google Compute Engine.
Pricing and service details have not been finalized for MapR's services on the Google Compute Engine. Basic compute pricing on the Compute Engine starts at $0.145 per hour for a single core with 3.75 gigabytes of memory. See our hands-on review of the Google Compute Engine private beta.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.