Apache Hadoop has come a long way since the first Hadoop Summit took place in 2007. From its humble origins as a promising open-source framework for managing data-intensive distributed applications, Hadoop has mushroomed into the leading big data platform, one doing real work at Fortune 500 corporations.
This year's Hadoop Summit, co-sponsored by Yahoo and Hortonworks, takes place June 26-27 in San Jose, Calif. The 2-day event is expected to draw 2,500 to 3,000 attendees and will feature more than 90 breakout sessions on all things Hadoop, according to John Kreisa, vice president of strategic marketing for Hortonworks.
"I've been working with the technology for three or four years now, and over that time Hadoop has gone from the experimental, 'We've got a test cluster set up,' to 'OK, here's what we're going to do with it,'" Kreisa told InformationWeek.
[ Cray puts Hadoop on its supercomputers. Read Cray Brings Hadoop To High-Performance Computing. ]
The theme of this year's conference is Hadoop's "maturation," spotlighting the platform as a key component of the next generation of data architectures. "Effectively, Hadoop has matured now as a technology such that mainstream enterprises are using it for a wide variety of workloads," Kreisa said. Summit attendees will hear presentations from major corporations, including Cardinal Health, Home Depot, and Kohl's, that are using Hadoop for real workloads.
Despite Hadoop's growing popularity in the enterprise, however, it has its shortcomings, most notably a reputation for being difficult to use. There's also the problem of what to do with all that big data once you've collected it.
As InfomationWeek's Doug Henschen writes, "In contrast to NoSQL, Hadoop seems to be getting all the credit it deserves and then some. By many accounts, it's the be-all and end-all of big data, despite the fact that the lion's share of deployments today are little more than digital landfills."
Kreisa counters that "digital landfill" is an interesting analogy, but not one that represents what he's seeing in the enterprise. "The term that we hear companies using, large financial services and telecommunications (firms), is 'data lake' or 'data reservoir,'" he said, adding that these organizations are able to "spin out" new analytic applications based on the data they're collecting.
Kreisa does acknowledge, however, that Hadoop has "a few rough edges that need to be sanded off," particularly in the areas of deployment and manageability. "These things continue to evolve," he said. "Hadoop is a large distributed system with lots of moving parts. A modern Hadoop platform will have 10 or 12 open-source projects as subcomponents."
Hadoop is arguably the best-known and most widely used big data management platform, but it certainly isn't the only option for enterprises. Should its proponents be worried?
"I don't see any serious competitors to Hadoop," Kreisa said. "There are lots of other technologies that fill different workload components, and part of it comes down to the underlying file system."
He continued, "Generally speaking, HDFS, the Hadoop Distributed File System, has almost really won the battle. If you look at other architectures, where people may try to replace the query engine on top of it … HDFS is still the underlying place where that data is coming to rest."
There's still a significant need for Hadoop training, Kreisa added, which in part is what this week's Summit is all about. "There needs to be growth in skills, because again, it's a complex distributed storage system that's not like the other things that people are using today."
To understand how to secure big data, you have to understand what it is -- and what it isn't. In the Security Implications Of Big Data Strategies report, we show you how to alter your security strategy to accommodate big data -- and when not to. (Free registration required.)