Cloud // Software as a Service
News
6/27/2014
10:09 AM
Connect Directly
Twitter
RSS
E-Mail
50%
50%

Google I/O: Hello Dataflow, Goodbye MapReduce

Google introduces Dataflow to handle streams and batches of big data, replacing MapReduce and challenging other public cloud services.

Hadoop Jobs: 9 Ways To Get Hired
Hadoop Jobs: 9 Ways To Get Hired
(Click image for larger view and slideshow.)

Google I/O this year was overwhelmingly dominated by consumer technology, the end user interface, and extension of the Android universe into a new class of mobile devices, the computer you wear on your wrist.

At the same time, there were one or two enterprise-scale data handling and cloud computing gems scattered among all the end user announcements.

One was Cloud Dataflow, introduced at the San Francisco event during a keynote presentation Wednesday. When it comes to handling large amounts of unstructured data, one of Google's original contributions to the field was MapReduce. When combined with a distributed file system, it became a fundamental new type of data sorting, analyzing, and storage mechanism of the era: Hadoop.

At this year's developer conference, Google executives said MapReduce was so 2004-ish. It's batch oriented, when what you really need is a system that can handle both a large amount of data set aside for a scheduled batch process and one that can handle an ad hoc stream of unsorted data. In producing Dataflow, Google is attempting to steal a march on other public cloud services and provide a two-in-one data sorting and data analysis system.

[Want to learn more about changes in the Android user interface? See Google I/O: Android Interface, Cloud Advances Star.]

At Amazon Web Services, for example, you might use Elastic MapReduce for the batch process and Kinesis, introduced last November at Amazon's Re:Invent event, for real-time streaming data. On Google App Engine or Compute Engine, you can use Cloud Dataflow for both tasks.

Urs Holzle, the Swiss native who fills the role of senior VP of engineering at Google, introduced Dataflow, saying it could build parallel pipelines to move data through a transformation and analysis system, regardless of the size of the data stream.

Dataflow is both a software development kit and a managed service that lets customers build the data capture and transformation process that they wish to use. The Dataflow demonstration took a stream of tweets on the World Soccer Cup games and converted the data to JSON object data, then transformed it using a Twitter API that provided the core data extraction, then analyzed it for fan sentiment using Alchemy's third party service.

The system analyzed 5.2 million tweets before the event, then starting adding 402 Twitter records a second to the results. The demonstration showed how in the opening game between Brazil and Croatia, fan sentiment in favor of Brazil dipped after Brazil scored a goal. That was contrary to expectations; the record showed fan sentiment usually went up when a team scored. Further analysis showed fans largely disagreed with a referee's call that allowed Brazil to score.

Holzle said the spotting of anomalies like that in masses of big data can lead to a greater understanding of a customer base or what appeals to the general public. For programmers, the task of creating the transformation points in the pipeline have been simplified by Dataflow, which "handles the scaling, does the scheduling, deploys the virtual machines, and does the monitoring for you."

Some enterprises working with big data might try to use MapReduce for the task, but "MapReduce, which we invented over a decade ago, would be too cumbersome for the task," he said.

"Information is being generated at an incredible rate. We want you to be able to analyze that information without worrying about scalability," he said.

Underneath Dataflow is a basic Google innovation, FlumeJava, which has the capability of applying "a modest number of operations" on parallel streams of data. FlumeJava is able to construct an execution plan, rather than merely try to expand a plan that's proving unequal to an increased data stream, according to its citation by the Association of Computing Machinery's Digital Library.

Holzle also claimed the Google Cloud Platform "leads in price and performance" among IaaS providers, a claim Amazon might dispute, and has incorporated the progression of Moore's Law into its cloud pricing structure. With the two companies dropping prices as fast as they were earlier this year, and with Microsoft following suit, cloud pricing won't be leveling off anytime soon.

Application storage services have dropped 30% to 53% at Google, permanent storage by 68%, and BigQuery data services by 85%, he said.

In another market sector, Xavier Ducrohet, the Android SDK lead, said it's now possible to import Eclipse projects into the Android Studio integrated development environment, which is better able to stage code according to what Android device it's going to run in.

Once inside Studio, the lengthy lists of file names that make up big projects can be made easier to explore, since the integrated development environment now lists all the elements in a file when it's highlighted. Targeting a device form factor amounts to selecting a check box, with option of selecting as many as needed. Even the Wear form factor -- the new wrist computer, currently only available as a little rectangle -- has the option of checking off a circular Wear device. Wrist computers that look like round white watches are coming later this summer.

The IDE even gives developers the option of looking at their screen layout as a left-to-right presentation or reverse right-to-left one, which might be adopted in some form factors. The IDE also contains a red underlining function that automatically highlights an API name when the version is inconsistent with the needs of the program. It can also identify known problems in the source code.

"The next step is building in better stability and performance," he said.

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

Charles Babcock is an editor-at-large for InformationWeek, having joined the publication in 2003. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
senorcarbone
50%
50%
senorcarbone,
User Rank: Apprentice
7/3/2014 | 9:56:10 AM
Re: Dataflow implications for Hadoop community?
You can take a look on the following Apache incubator projects as the open source equivalents on parallel dataflow processing

- Apache Tez (https://tez.incubator.apache.org/)

- Apache Flink (former Stratosphere) (https://github.com/apache/incubator-flink
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
6/27/2014 | 8:15:00 PM
Google's goal: a competitive service
Agreed, many implications for Hadoop and other open source communities, Doug. There was no discussion on those issues that I got to witness at Google I/O. The thing I see is Google using its data handling expertise to build a competitive cloud service, let the chips fall where they may.
D. Henschen
50%
50%
D. Henschen,
User Rank: Author
6/27/2014 | 11:38:02 AM
Dataflow implications for Hadoop community?
Charlie, this Dataflow announcement would seem to have huge implications for the Hadoop community, given that that platform was inspired by Google's technologies and white papers. With the introduction of YARN, the Hadoop community, too, has been moving away from MapReduce, but what's the Hadoop world's equivalent of Dataflow? Was anybody drawing parallels/similarities to Apache Spark or Apache Storm, for example?
8 Steps to Modern Service Management
8 Steps to Modern Service Management
ITSM as we know it is dead. SaaS helped kill it, and CIOs should be thankful. Hereís what comes next.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest September 18, 2014
Enterprise social network success starts and ends with integration. Here's how to finally make collaboration click.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.