Software
News
7/27/2012
03:55 PM
Connect Directly
Google+
RSS
E-Mail
50%
50%

How Hadoop Cuts Big Data Costs

Hadoop systems, including hardware and software, cost about $1,000 a terabyte, or as little as one-twentieth the cost of other data management technologies, says Cloudera exec.

12 Hadoop Vendors To Watch In 2012
12 Hadoop Vendors To Watch In 2012
(click image for larger view and for slideshow)
Managing prodigious volumes of data is not only challenging from a technological standpoint, it's often expensive as well. Apache Hadoop is a data management system adept at bring data processing and analysis to raw storage. It's a cost-effective alternative to a conventional extract, transform, and load (ETL) process that extracts data from different systems, converts it into a structure suitable for analysis and reporting, and loads it onto a database.

"Big data tends to overwhelm ETL as a process," said Charles Zedlewski, VP of product at Cloudera, during a Carahsoft-hosted webinar this week. Cloudera sells a data-management platform built on Hadoop open-source software.

"The opportunity for big data has been stymied by the limitations of today's current data management architecture," said Zedlewski, who called Hadoop "very attractive" for advanced analytics and data processing.

Enterprises that ingest massive amounts of data--50 terabytes per day, for instance--aren't well-served by ETL systems. "It's very common to hear about people who are starting to miss the ETL window," Zedlewski said. "The number of hours it takes to pre-process data before they can make use out of it has grown from four hours to five, six. In many cases, the amount of time is exceeding 24 hours."

In other words, there aren't enough hours in the day to process the volume of data received in a 24-hour period. Hadoop, by comparison, performs advanced data processing and analysis at very high speeds. It's highly scalable and flexible, too.

[ Read IT's Next Hot Job: Hadoop Guru. ]

"Scalability is obviously very essential for big data projects--the whole point is that it's big," Zedlewski said. With Hadoop it's possible to store--and actually ask questions of--100 petabytes of data. "That's something that was never before possible, and is arguably at least 10 times more scalable than the next best alternative," he added.

Apache Hadoop is more than six years old and was developed to help Internet-based companies deal with prodigious volumes of data. A Hadoop system typically integrates with databases or data warehouses. "It's common that Hadoop is used in conjunction with databases. In the Hadoop world, databases don't go away. They just play a different role than Hadoop does," said Zedlewski.

Hadoop's most powerful attribute is its flexibility. "This is probably the single greatest reason why people are attracted to the system," said Zedlewski. Hadoop lets you store and capture all kinds of different data, including documents, images, and video, and make it readily available for processing and analysis.

The cost of a Hadoop data management system, including hardware, software, and other expenses, comes to about $1,000 a terabyte--about one-fifth to one-twentieth the cost of other data management technologies, Zedlewski estimated. Pre-existing data management technologies, by comparison, might make big data projects uneconomical.

"If you look at network storage, it's not unreasonable to think of a number on the order of about $5,000 per terabyte," said Zedlewski. "Sometimes it goes much higher than that. If you look at databases, data marts, data warehouses, and the hardware that supports them, it's not uncommon to talk about numbers more like $10,000 or $15,000 a terabyte."

And because legacy data management technologies often store multiple copies of the same data on different systems, the total cost might be more like $30,000 to $40,000 per terabyte, Zedlewski claims.

Hadoop isn't a cure-all for every use case, but it has proven effective in a variety of industries. In manufacturing, for instance, Hadoop is used to assess product quality. In telecommunications, it's used for content mediation. And it's popular among government agencies for a variety of applications, including security, search, geo-spatial data, and location-based push of data.

Big data places heavy demands on storage infrastructure. In the new, all-digital Big Storage issue of InformationWeek Government, find out how federal agencies must adapt their architectures and policies to optimize it all. Also, we explain why tape storage continues to survive and thrive.

Comment  | 
Print  | 
More Insights
Google in the Enterprise Survey
Google in the Enterprise Survey
There's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity ­products, and 69 percent cite Google Apps' good or excellent ­mobility. But progress could still stall: 59 percent of nonusers ­distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest September 18, 2014
Enterprise social network success starts and ends with integration. Here's how to finally make collaboration click.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
The weekly wrap-up of the top stories from InformationWeek.com this week.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.