Hadoop Big Data Startup Spins Out Of Yahoo

Startup Hortonworks will bring developers, plus a capital infusion from Yahoo, to speed up development of the open source code for big data analysis.

Charles Babcock, Editor at Large, Cloud

June 28, 2011

5 Min Read

Slideshow: Yahoo's Hadoop Implementation

Slideshow: Yahoo's Hadoop Implementation

Slideshow: Yahoo's Hadoop Implementation(click for larger image and for full slideshow)

A core development group at Yahoo is being given venture capital backing and spun off to further the rapid enterprise-style development of Hadoop.

Within a few days, "something over 20" core committers and architects of Hadoop code will move off the Yahoo campus in Sunnyvale, Calif., into offices nearby to form Hortonworks, an independent company, said Eric Baldeschweiler, Yahoo's VP of software engineering for Hadoop, in an interview. He will become CEO of the new firm.

The name Hortonworks springs from a Dr. Seuss children's book about an elephant, Horton. Hadoop was originally named by its co-originator, Dave Cutting, for one of his children's stuffed elephant toys.

The move to create a self-sufficient company devoted to Hadoop commercialization comes on the heels of a LexisNexis announcement last week that its High Performance Computing Cluster (HPCC) big-data system will be publicly available as an open source project. HPCC is a future competitor of Hadoop in the big-data handling arena, its spokesmen said.

Jay Rossiter, senior VP of the cloud platform group at Yahoo, said Hortonworks not only has Yahoo's blessing but Yahoo is an investor in it, along with Benchmark Capital. The number of developers leaving Yahoo is a fraction of the total working on Hadoop. The two groups will "co-develop the next Hadoop release together," Rossiter said in an interview.

Rob Bearden, a partner at Benchmark, will become COO of Hortonworks. He is the former president of SpringSource, the firm behind the Spring Framework for Java developers, acquired in 2009 by VMware for $420 million. He is also the former COO of JBoss, the open source Java application server that was sold to Red Hat. He is the current chairman of Pentaho, an open source business intelligence system supplier.

Hortonworks will continue core development of Hadoop and also design ease-of-installation and -use features, Bearden said in an interview. All its development will be contributed to the Apache Software Foundation's Hadoop open source project, an effort that Yahoo has backed in full since February. Hadoop was created in 2005 by Cutting and a partner, Mike Cafarella, when Cutting was an engineer at Yahoo. Yahoo is one of the world's largest users of Hadoop and its developers are believed to have contributed about 70% of its code.

Cutting left Yahoo in 2009 for an early Hadoop startup, Cloudera, which has established itself as a packager and ease-of-implementation vendor for Hadoop. Hortonworks and Cloudera are potential competitors. In May, another Hadoop startup emerged, Datameer, with $9.25 million in venture funding. The figure was disclosed for the funding behind Hortonworks.

Prior to February, Yahoo tested its own production version of Hadoop. Its knowledge of testing and patching were recognized as thorough, and its production version was frequently adopted by other companies as Yahoo made it available. Now the updates or "builds" of Hadoop emanating from Apache are used as the most reliable versions.

Baldeschwieler said Yahoo will remain an important proving ground for changes and improvements to Hadoop. Yahoo uses 18 Hadoop systems on a total of 42,000 servers to perform such functions as: indexing the contents of the Web; delivering personalized content to Yahoo site visitors; screening spam out of Yahoo's email service; and serving advertising to Yahoo search users. Through the application of Hadoop, Yahoo has been able to improve the click-through rate on its home page by 270% by coming up with content that matches individual interests, Rossiter said.

Baldeschwieler said Benchmark Capital wanted to make an investment in a Hadoop play and approached Yahoo to split out a team of leading developers. Yahoo agreed, he said, because it wants to see a vigorous community sustained around Hadoop and wide adoption in the enterprise. A firm devoted to creating enterprise software will further that goal.

Although Yahoo runs Hadoop on 42,000 servers, the most servers running one system is 4,000. Hadoop is a parallel file distribution system that maps where files are located on a cluster and then sends sorting and analysis work to nodes that are closest to the data. Baldeschwieler said a complex problem of building maps of the United States using millions of small image tiles originally took six weeks with a Yahoo graphics processing system. When Hadoop was added to the process, it took five days.

Hortonworks will focus on improving Hadoop performance, making it easier to install and providing APIs for third parties to use to attach monitoring and management systems, he said.

Yahoo will also remain in the Hadoop development picture, keeping a large number of developers committed to the project. "Yahoo will continue to provide thought leadership on Hadoop... we have unmatched domain expertise," Rossiter said. Yahoo will provide testing and a large-scale production environment where Hadoop changes will be driven to the max. Hadoop has over 1,000 users inside the company, he said.

While Hortonworks will be separated from Yahoo as a company, the Hadoop teams won't be far apart. The startup's headquarters will be "very close to the Yahoo campus," Baldeschwieler said.

"We anticipate that within five years, more than half the world's data will be stored in Apache Hadoop," said Baldeschwieler in the Hortonworks announcement.

As the volume of corporate data continues to grow, IT pros keep investing in new storage usage technologies. Compression still ranks No. 1, according to InformationWeek Analytics' 2010 Data Deduplication Survey, though respondents rely increasingly on dedupe, as well as thin provisioning and MAID. Download it here (registration required).

About the Author(s)

Charles Babcock

Editor at Large, Cloud

Charles Babcock is an editor-at-large for InformationWeek and author of Management Strategies for the Cloud Revolution, a McGraw-Hill book. He is the former editor-in-chief of Digital News, former software editor of Computerworld and former technology editor of Interactive Week. He is a graduate of Syracuse University where he obtained a bachelor's degree in journalism. He joined the publication in 2003.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights