Hadoop Ecosystem Evolves: 10 Cool Big Data Projects
In the 10 years since developers created Hadoop to wrangle the challenges that came with big data, the ecosystem for these technologies has evolved. The Apache Software Foundation is teeming with open source big data technology projects. Here's a look at some significant projects, and a peek at some up-and-comers.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/bltf5d39468817de8fb/64cb3ab8cce374e3d45206c8/bigdata-iStock_000049675334_Medium.jpg?width=700&auto=webp&quality=80&disable=upscale)
Managing and analyzing big data -- the exponentially growing body of information collected from social media, sensors attached to "things" in the Internet of Things (IoT), structured data, unstructured data, and everything else that can be collected -- has become a massive challenge. To tackle the task, developers have created a new set of open source technologies.
The flagship software, Apache Hadoop, an Apache Software Foundation project, celebrated its 10th anniversary last month. A lot has happened in those 10 years. Many other technologies are now also a part of the big data and Hadoop ecosystem, mostly within the Apache Software Foundation, too.
Spark, Hive, HBase, and Storm are among the options developers and organizations are using to create big data technologies and contribute them to the open source community for further development and adoption.
Some of these technologies are in production at enterprises such as Netflix and LinkedIn. They enable organizations to work with massive amounts of data in real time and turn that data around to improve services for end customers.
[Want to learn more about Hadoop? Read Hadoop At 10: Milestones And Momentum.]
These big data technologies often are born within organizations that are trying to enhance the way in which big data technologies work and improve their speed. They represent an evolution of the ecosystem, and the next wave of open source technology, which proves that development by a community of smart participants can be better than development within a propriety corporate environment.
This modern era of open source and big data all started with Hadoop, most often described as an open source framework for distributed storage and processing of large sets of data on commodity hardware.
"Hadoop created this center of gravity for a new data architecture to emerge," Shaun Connolly, VP of corporate strategy at Hadoop distribution company Hortonworks, told InformationWeek in an interview. "Hadoop has this ecosystem of interesting projects that have grown up around it."
And the evolution continues. New projects are accepted into the Apache Software Foundation's big data ecosystem all the time. Most recently, Apache Arrow became a Top-Level Project. Other projects may enter the ecosystem as part of the Apache Software Foundation's Incubator. IBM's SystemML machine learning engine for Spark gained acceptance as an Incubator project late last year.
There are many projects that are part of the Apache Software Foundation's big data ecosystem. Here's a look at some of the significant ones, and a peek at a few up-and-comers. Once you've reviewed our choices, let us know what you think in the comments section below. Are there any you prefer? Are there some we've missed? We'd love to hear from you.
Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.
Hadoop is really the flagship technology for open source big data. It grew out of a side project at Yahoo when developers needed a way to store and process the massive amount of data they collected with their new search engine. The technology was eventually contributed to the Apache Software Foundation. Today there are three major distributions from commercial companies -- Cloudera, Hortonworks, and MapR. One of Hadoop's creators, Doug Cutting, recently spoke with InformationWeek about the growth of his baby. We also recently put together a look at Hadoop's history.
Apache Hive was initially developed by Facebook and contributed to the Apache Software Foundation. The technology is a data warehouse infrastructure built on top of Hadoop to provide data summarization, query, and analysis.
Companies using Hive include CNET and eHarmony.
Apache HBase grew out of a project at a company called Powerset, which was acquired by Microsoft in 2008. The goal was to process massive amounts of data for natural language search. The technology is an open source, non-relational, distributed database that is modeled after Google's BigTable and written in Java. HBase became an Apache Software Foundation project in 2010.
Companies using HBase today include Adobe, Facebook, Meetup, and Trend Micro.
Apache Spark is the rising star of the big data ecosystem. The technology was originally developed at the AMPLab at the University of California, Berkeley. It can be used as a faster alternative to Hadoop's MapReduce because Spark uses in-memory instead, producing performance that can be up to 100 times faster, depending upon the application.
Spark's developers now work at Databricks, which provides major support to the project within the Apache Software Foundation, and also offers a commercial Spark-as-a-Service. As of the end of 2015, Spark was the most active open source project in all of big data, with more than 600 contributors in the previous 12 months.
Many companies are using Spark today, including Amazon, Autodesk, eBay, Groupon, OpenTable, and TripAdvisor.
Apache Kafka was originally developed as a project within LinkedIn as a messaging system for brokering the massive quantity of real-time data generated and processed by the company's consumer-facing careers website and platform.
Kafka was donated to open source in 2011 and graduated from the Apache Incubator program in 2012. The LinkedIn developers who created Kafka became part of a new company spun out of LinkedIn called Confluent.
Kafka is used by LinkedIn, Twitter, Netflix, Pinterest, Goldman Sachs, and Coursera.
Apache Storm is described on its project page as a distributed real-time computation system that makes it easy to reliably process unbounded streams of data, doing for real-time processing what Hadoop did for batch processing.
The technology is sometimes described as an alternative to Spark. BackType, the company that developed Storm, was acquired by Twitter in 2011. Storm became a Top-Level Project at the Apache Software Foundation in 2014 after graduating from the Incubator.
Twitter has since developed its own in-house system for handling the tasks originally assigned to Storm. Companies using Storm include Yahoo and Spotify.
Apache Nifi, originally called Niagara Files, is a technology transfer project developed by the US National Security Agency (NSA) and spun out to the Apache Software Foundation as an incubator project in November 2014. It became a Top-Level Project in 2015.
Nifi tackles the problem of how to automate the flow of data between systems. Its project page at the Apache Software Foundation says the technology "supports powerful scalable directed graphs of data routing, transformation, and system mediation logic."
It provides a Web-based user interface. And, as you might expect from an NSA-created project, it offers security features including SSL, SSH, HTTPS, encrypted content, and pluggable, role-based authentication and authorization.
The Apache Foundation accepted Apache Flink as a Top-Level Project in January 2015. The technology is a distributed data analysis engine for batch and streaming data that offers programming APIs in Java and Scala.
The project was born out of the Stratosphere research project in Berlin. Organizations using Flink include Capital One and Data Artisans.
Apache Arrow was accepted as a Top-Level Project by the Apache Software Foundation this month. The technology comes out of the company Dremio, which has also contributed the Apache Drill project. Dremio's founders came out of MapR, an Apache Hadoop distribution company.
Arrow was initially seeded by code from the Apache Drill project, according to the Apache Software Foundation. Arrow provides columnar in-memory analytics, according to Dremio co-founder and CTO Jacques Nadeau.
These are some of the highlights of the big data projects in the Hadoop ecosystem at the Apache Software Foundation. Many others have been donated. Development is ongoing for all these projects, which are fully documented at the Apache Software Foundation website.
"The Apache Way is community over code," Connolly told InformationWeek. "While technology is interesting, the Apache Way is about the community first. You check your [company's] badge at the door."
These are some of the highlights of the big data projects in the Hadoop ecosystem at the Apache Software Foundation. Many others have been donated. Development is ongoing for all these projects, which are fully documented at the Apache Software Foundation website.
"The Apache Way is community over code," Connolly told InformationWeek. "While technology is interesting, the Apache Way is about the community first. You check your [company's] badge at the door."
-
About the Author(s)
You May Also Like