Google introduces Dataflow to handle streams and batches of big data, replacing MapReduce and challenging other public cloud services.

Charles Babcock, Editor at Large, Cloud

June 27, 2014

5 Min Read

Hadoop Jobs: 9 Ways To Get Hired

Hadoop Jobs: 9 Ways To Get Hired


Hadoop Jobs: 9 Ways To Get Hired (Click image for larger view and slideshow.)

Google I/O this year was overwhelmingly dominated by consumer technology, the end user interface, and extension of the Android universe into a new class of mobile devices, the computer you wear on your wrist.

At the same time, there were one or two enterprise-scale data handling and cloud computing gems scattered among all the end user announcements.

One was Cloud Dataflow, introduced at the San Francisco event during a keynote presentation Wednesday. When it comes to handling large amounts of unstructured data, one of Google's original contributions to the field was MapReduce. When combined with a distributed file system, it became a fundamental new type of data sorting, analyzing, and storage mechanism of the era: Hadoop.

At this year's developer conference, Google executives said MapReduce was so 2004-ish. It's batch oriented, when what you really need is a system that can handle both a large amount of data set aside for a scheduled batch process and one that can handle an ad hoc stream of unsorted data. In producing Dataflow, Google is attempting to steal a march on other public cloud services and provide a two-in-one data sorting and data analysis system.

[Want to learn more about changes in the Android user interface? See Google I/O: Android Interface, Cloud Advances Star.]

At Amazon Web Services, for example, you might use Elastic MapReduce for the batch process and Kinesis, introduced last November at Amazon's Re:Invent event, for real-time streaming data. On Google App Engine or Compute Engine, you can use Cloud Dataflow for both tasks.

Urs Holzle, the Swiss native who fills the role of senior VP of engineering at Google, introduced Dataflow, saying it could build parallel pipelines to move data through a transformation and analysis system, regardless of the size of the data stream.

Dataflow is both a software development kit and a managed service that lets customers build the data capture and transformation process that they wish to use. The Dataflow demonstration took a stream of tweets on the World Soccer Cup games and converted the data to JSON object data, then transformed it using a Twitter API that provided the core data extraction, then analyzed it for fan sentiment using Alchemy's third party service.

The system analyzed 5.2 million tweets before the event, then starting adding 402 Twitter records a second to the results. The demonstration showed how in the opening game between Brazil and Croatia, fan sentiment in favor of Brazil dipped after Brazil scored a goal. That was contrary to expectations; the record showed fan sentiment usually went up when a team scored. Further analysis showed fans largely disagreed with a referee's call that allowed Brazil to score.

Holzle said the spotting of anomalies like that in masses of big data can lead to a greater understanding of a customer base or what appeals to the general public. For programmers, the task of creating the transformation points in the pipeline have been simplified by Dataflow, which "handles the scaling, does the scheduling, deploys the virtual machines, and does the monitoring for you."

Some enterprises working with big data might try to use MapReduce for the task, but "MapReduce, which we invented over a decade ago, would be too cumbersome for the task," he said.

"Information is being generated at an incredible rate. We want you to be able to analyze that information without worrying about scalability," he said.

Underneath Dataflow is a basic Google innovation, FlumeJava, which has the capability of applying "a modest number of operations" on parallel streams of data. FlumeJava is able to construct an execution plan, rather than merely try to expand a plan that's proving unequal to an increased data stream, according to its citation by the Association of Computing Machinery's Digital Library.

Holzle also claimed the Google Cloud Platform "leads in price and performance" among IaaS providers, a claim Amazon might dispute, and has incorporated the progression of Moore's Law into its cloud pricing structure. With the two companies dropping prices as fast as they were earlier this year, and with Microsoft following suit, cloud pricing won't be leveling off anytime soon.

Application storage services have dropped 30% to 53% at Google, permanent storage by 68%, and BigQuery data services by 85%, he said.

In another market sector, Xavier Ducrohet, the Android SDK lead, said it's now possible to import Eclipse projects into the Android Studio integrated development environment, which is better able to stage code according to what Android device it's going to run in.

Once inside Studio, the lengthy lists of file names that make up big projects can be made easier to explore, since the integrated development environment now lists all the elements in a file when it's highlighted. Targeting a device form factor amounts to selecting a check box, with option of selecting as many as needed. Even the Wear form factor -- the new wrist computer, currently only available as a little rectangle -- has the option of checking off a circular Wear device. Wrist computers that look like round white watches are coming later this summer.

The IDE even gives developers the option of looking at their screen layout as a left-to-right presentation or reverse right-to-left one, which might be adopted in some form factors. The IDE also contains a red underlining function that automatically highlights an API name when the version is inconsistent with the needs of the program. It can also identify known problems in the source code.

"The next step is building in better stability and performance," he said.

InformationWeek's new Must Reads is a compendium of our best recent coverage of the Internet of Things. Find out the way in which an aging workforce will drive progress on the Internet of Things, why the IoT isn't as scary as some folks seem to think, how connected machines will change the supply chain, and more. (Free registration required.)

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights