Google announced the general availability of two big data products formerly in beta: Google Cloud Dataflow and Google Cloud Pub/Sub. The two tools complete Google's plan to bring its entire suite of internal big data tools into general availability.
Cloud Data Flow is a Google service for streaming big data on Google Compute Engine and App Engine without incurring the operational overhead of managing a large server cluster. Cloud Pub/Sub integrates applications and services with real-time analysis of data streams.
The two products join Google's existing BigQuery SQL-query based system for analyzing large data streams and data sets.
Adding Cloud Data Flow and Cloud Pub/Sub puts Google on a more equal footing with Amazon Web Services, which has proven light on its feet when it comes to introducing new cloud services. Google Cloud Data Flow has a rough counterpart in Amazon's existing Data Pipeline, Google Cloud Pub/Sub with Amazon Kinesis, and Google BigQuery with Amazon DynamoDB. Amazon also has a Hadoop-type service with Elastic MapReduce.
[Want to learn more about Amazon's big data products? See How Amazon Kinesis Adds Speed, Resilience To Analytics.]
Google's announcement said the two new services are based on a decade of investment in data handling, including MapReduce for simple data processing on large clusters, FlumeJava's parallel data pipelines, and Millwheel's fault-tolerant, large-data-stream processing.
In addition, Google is offering some of the lessons it's learned from its in-house data handling in the new products. Cloud Dataflow "is specifically designed to remove the complexity of developing separate systems for batch and streaming data sources by providing a unified programming model," the Google announcement said. Dataflow is fault tolerant, highly available, and backed by a Google SLA.
The Cloud Dataflow service is two to three times faster than Hadoop when evaluated against classic MapReduce-based pipelines, such as Google PageRank and WordCount, the announcement said. In the cloud, optimized performance means less time spent on the compute servers, leading to lower charges, it said.
Google pointed out that the Cloud Dataflow SDK includes connectors to Salesforce, Clearstory, Tamr, SpringML, Cloudera, and Data Artisans. Cloudera's Director 1.5, now integrated with Google Cloud Platform, became available Wednesday, Aug. 12, as well. Cloudera's Hadoop platform is now certified to run on Google's Compute Engine and App Engine, so users may run Hadoop clusters with Cloudera enterprise Hadoop software.
The Cloud Pub/Sub service is meant to allow a cloud system to deliver multiple messages to large numbers of users at high speeds. Instead of a hard-wired, one-to-one queue, Pub/Sub allows a message to be "fanned-out" to many subscribers at the same time, or multiple publishers to "fan-in" many messages at the same time. If recipients are not online when the message is sent, they will get it as soon as they log back in, Google said in the announcement.
Pub/Sub, Cloud Dataflow, and other data services will be offered from Google data centers around the globe.