How Hadoop Cuts Big Data Costs - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


How Hadoop Cuts Big Data Costs

Hadoop systems, including hardware and software, cost about $1,000 a terabyte, or as little as one-twentieth the cost of other data management technologies, says Cloudera exec.

12 Hadoop Vendors To Watch In 2012
12 Hadoop Vendors To Watch In 2012
(click image for larger view and for slideshow)
Managing prodigious volumes of data is not only challenging from a technological standpoint, it's often expensive as well. Apache Hadoop is a data management system adept at bring data processing and analysis to raw storage. It's a cost-effective alternative to a conventional extract, transform, and load (ETL) process that extracts data from different systems, converts it into a structure suitable for analysis and reporting, and loads it onto a database.

"Big data tends to overwhelm ETL as a process," said Charles Zedlewski, VP of product at Cloudera, during a Carahsoft-hosted webinar this week. Cloudera sells a data-management platform built on Hadoop open-source software.

"The opportunity for big data has been stymied by the limitations of today's current data management architecture," said Zedlewski, who called Hadoop "very attractive" for advanced analytics and data processing.

Enterprises that ingest massive amounts of data--50 terabytes per day, for instance--aren't well-served by ETL systems. "It's very common to hear about people who are starting to miss the ETL window," Zedlewski said. "The number of hours it takes to pre-process data before they can make use out of it has grown from four hours to five, six. In many cases, the amount of time is exceeding 24 hours."

In other words, there aren't enough hours in the day to process the volume of data received in a 24-hour period. Hadoop, by comparison, performs advanced data processing and analysis at very high speeds. It's highly scalable and flexible, too.

[ Read IT's Next Hot Job: Hadoop Guru. ]

"Scalability is obviously very essential for big data projects--the whole point is that it's big," Zedlewski said. With Hadoop it's possible to store--and actually ask questions of--100 petabytes of data. "That's something that was never before possible, and is arguably at least 10 times more scalable than the next best alternative," he added.

Apache Hadoop is more than six years old and was developed to help Internet-based companies deal with prodigious volumes of data. A Hadoop system typically integrates with databases or data warehouses. "It's common that Hadoop is used in conjunction with databases. In the Hadoop world, databases don't go away. They just play a different role than Hadoop does," said Zedlewski.

Hadoop's most powerful attribute is its flexibility. "This is probably the single greatest reason why people are attracted to the system," said Zedlewski. Hadoop lets you store and capture all kinds of different data, including documents, images, and video, and make it readily available for processing and analysis.

The cost of a Hadoop data management system, including hardware, software, and other expenses, comes to about $1,000 a terabyte--about one-fifth to one-twentieth the cost of other data management technologies, Zedlewski estimated. Pre-existing data management technologies, by comparison, might make big data projects uneconomical.

"If you look at network storage, it's not unreasonable to think of a number on the order of about $5,000 per terabyte," said Zedlewski. "Sometimes it goes much higher than that. If you look at databases, data marts, data warehouses, and the hardware that supports them, it's not uncommon to talk about numbers more like $10,000 or $15,000 a terabyte."

And because legacy data management technologies often store multiple copies of the same data on different systems, the total cost might be more like $30,000 to $40,000 per terabyte, Zedlewski claims.

Hadoop isn't a cure-all for every use case, but it has proven effective in a variety of industries. In manufacturing, for instance, Hadoop is used to assess product quality. In telecommunications, it's used for content mediation. And it's popular among government agencies for a variety of applications, including security, search, geo-spatial data, and location-based push of data.

Big data places heavy demands on storage infrastructure. In the new, all-digital Big Storage issue of InformationWeek Government, find out how federal agencies must adapt their architectures and policies to optimize it all. Also, we explain why tape storage continues to survive and thrive.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
2021 Outlook: Tackling Cloud Transformation Choices
Joao-Pierre S. Ruth, Senior Writer,  1/4/2021
Enterprise IT Leaders Face Two Paths to AI
Jessica Davis, Senior Editor, Enterprise Apps,  12/23/2020
10 IT Trends to Watch for in 2021
Cynthia Harvey, Freelance Journalist, InformationWeek,  12/22/2020
White Papers
Register for InformationWeek Newsletters
The State of Cloud Computing - Fall 2020
The State of Cloud Computing - Fall 2020
Download this report to compare how cloud usage and spending patterns have changed in 2020, and how respondents think they'll evolve over the next two years.
Current Issue
2021 Top Enterprise IT Trends
We've identified the key trends that are poised to impact the IT landscape in 2021. Find out why they're important and how they will affect you.
Flash Poll