Apache Arrow To Speed Up Big Data - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
News

Apache Arrow To Speed Up Big Data

Apache Arrow has been accepted as a full-fledged project by the Apache Software Foundation. The technology is designed to improve the performance and speed of big data components that work together as part of a larger system. The project is backed by the founders of Dremio, who are also the force behind Apache Drill.

Hadoop At 10: Milestones And Momentum
Hadoop At 10: Milestones And Momentum
(Click image for larger view and slideshow.)

One constant theme running through IT is the quest to get everything to connect with everything else, seamlessly and without problems. The world of big data took a step in this direction with the announcement of Apache Arrow by the Apache Software Foundation.

The Apache Software Foundation expects Apache Arrow to boost the performance of analytical workloads by a hundred-fold, as well as cutting communications overhead. Arrow will go forward as a "top level project," skipping the incubation period.

Arrow code is available now for implementation in C, C++, Python, and java, with future implementations due in 1 to 2 months for R, Javascript, and Julia, according to Jacques Nadeau, VP of the Arrow and Drill projects at the Apache Software Foundation and the co-founder and CTO of open source big data startup Dremio. "My role in driving this is getting all the users on the same page," he said.

Nadeau's company, Dremio, is a bit of a stealth big data startup, which has specialized in the Apache Drill project up until now. Nadeau and Dremio co-founder and CEO Tomer Shiran came to the company from MapR, an Apache Hadoop distribution company.

[eBay recently contributed a big data development to the Apache Software Foundation. Read How EBay's Kylin Tool Makes Sense Of Big Data.]

Arrow grew out of a need for improved performance in big data processing that many users were experiencing, Nadeau said. He explained the details in a blog post today and in an interview with InformationWeek.

"The core of Arrow is making processing systems faster," Nadeau continued. Arrow does this by enabling different big data components to talk to each other more easily. It does this by creating an internal representation of each big data system component so that data does not have to be copied and converted as it moves from Spark to Cassandra, or from Apache Drill to Kudu, for example.

(Image: bpalmer/iStockphoto)

(Image: bpalmer/iStockphoto)

Arrow also features columnar in-memory complex analytics. This is basically fusion of columnar data storage (like that provided by Apache Parquet), with systems that hold data in memory (like SAP HANA and Apache Spark), adding complex hierarchical and nested data structures (like JSON). Nadeau said that systems typically can support one of these three, a few will support two, but Arrow is the first to support all three. And it is all open source.

Further gains in processing speed are achieved by using CPUs more efficiently, Nadeau explained. "A lot of CPU cycles are wasted moving data between systems."

Arrow improves on CPU performance by lining up data to match CPU instructions and cache locality, thus streamlining the flow of data into the CPU. The CPU can stick to processing rather than searching and pulling data from the cache.

This data alignment also permits use of superword and SIMD Instructions (Single Instruction Multiple Data), which also boosts performance. SIMD executes multiple operations in a single clock cycle, increasing performance by two orders of magnitude, according to ASF. Optimizing cache locality, data pipelines, and SIMD, performance gains of 10x to 100x can be achieved, Nadeau said.

Apache Arrow should come into its own as users tap different tools for different missions in the realm of big data, Nadeau pointed out. "We are making each work load more efficient…[Arrow] will change a lot of things," he said.

Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.

William Terdoslavich is an experienced writer with a working understanding of business, information technology, airlines, politics, government, and history, having worked at Mobile Computing & Communications, Computer Reseller News, Tour and Travel News, and Computer Systems ... View Full Bio

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
State of the Cloud
State of the Cloud
Cloud has drastically changed how IT organizations consume and deploy services in the digital age. This research report will delve into public, private and hybrid cloud adoption trends, with a special focus on infrastructure as a service and its role in the enterprise. Find out the challenges organizations are experiencing, and the technologies and strategies they are using to manage and mitigate those challenges today.
Slideshows
Reflections on Tech in 2019
James M. Connolly, Editorial Director, InformationWeek and Network Computing,  12/9/2019
Slideshows
What Digital Transformation Is (And Isn't)
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/4/2019
Commentary
Watch Out for New Barriers to Faster Software Development
Lisa Morgan, Freelance Writer,  12/3/2019
Register for InformationWeek Newsletters
Video
Current Issue
The Cloud Gets Ready for the 20's
This IT Trend Report explores how cloud computing is being shaped for the next phase in its maturation. It will help enterprise IT decision makers and business leaders understand some of the key trends reflected emerging cloud concepts and technologies, and in enterprise cloud usage patterns. Get it today!
White Papers
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Sponsored Video
Flash Poll