Software // Information Management
Commentary
11/14/2012
10:23 AM
Doug Henschen
Doug Henschen
Commentary
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

Sears Hadoop Plans: Check Out Data Warehousing's Future

Will Hadoop become the new enterprise data warehouse? Sears' CTO is not alone in seeing a shift in how we'll use relational databases.

12 Hadoop Vendors To Watch In 2012
12 Hadoop Vendors To Watch In 2012
(click image for larger view and for slideshow)
Radio did not spell the end of newspapers, nor television the end of radio, nor the Internet the end of television. But each advance fundamentally changed the use of the prior platform. And so it will be with Hadoop and relational databases.

If the example of Sears can serve as our guide, Hadoop will become a popular central corporate data repository -- perhaps even the leading data repository eventually. It will take over that role not only because it can handle huge volumes of data more cost effectively than relational databases, but also because it easily ingests varied and complex data without first conforming it to a pre-defined schema, as you have to do when using a database. You can save all your data for the long term and apply schema when you need to use it, rather than imposing a schema before it's loaded onto the platform.

At Sears, Hadoop was first deployed three years ago and it has since become the central hub of all data management activity for the retailer. CTO Phil Shelley tells InformationWeek that Hadoop is giving Sears the flexibility and scale to make use of all the company's data. "We keep all the raw, transactional data, and because there's enough horsepower in Hadoop, you can then transform it into any form you want whenever you want on they fly rather than having to create cubes or aggregations," Shelley explains.

[ Want the inside story on big data plans at Sears? Read Why Sears Is Going All-In On Hadoop. ]

Hadoop has essentially become the enterprise data store at Sears, but that's not quite the same thing as an enterprise data warehouse. The difference is analysis, some of which can be done with the batch, MapReduce processing native to Hadoop. But the retailer is still using relational databases in many situations. InfoBright's columnar database, for example, is used for fast analysis of data aggregations that used to be created -- with much IT time and expense -- as multi-dimensional OLAP cubes. Cube building is now a thing of the past. Instead, fresh data sets are moved from Hadoop into InfoBright on a daily basis.

In another example, Sears' massive Teradata deployment continue to run high-scale, mission-critical analytical applications. "Teradata is still an important platform for us whenever we need a high-speed SQL interface," explains Shelley. "That could be when we're integrating with SAS [analytics] or doing custom analytics with SQL."

That puts Teradata in the role of analytic data mart, however, as opposed to its usual place as the enterprise data warehouse that holds all important data. Nonetheless, Sears is using more Teradata than ever, says Teradata, and perhaps that's because Hadoop enables the retailer to store and retain more data than ever. Sears is now saving data that it used to throw out and it's retaining indefinitely data that it used to keep for only 90 days or two years. More data for analysis brings more analysis.

Lots of Hadoop users share Shelley's perspective on how it can become a central hub for data management -- longtime Hadoop shop JP Morgan Chase started envisioning this role years ago. In fact, at last month's Strata New York event it seemed that the focus on Hadoop has shifted. The questions are no longer "what is Hadoop" and "does it make sense for my company?" People are now asking, "do I have the people I need to run Hadoop," and "how will I analyze and make use of all that information?"

For now, moving boiled-down data sets from Hadoop into existing relational environments will be part of the answer, but that approach involves data-movement delays that plenty of practitioners would like to avoid. "The BI industry has still got its head in the sand mostly because they're all still thinking about moving and copying data," Shelley tells InformationWeek "These vendor need to get their act together and write tools that run natively on Hadoop and don't copy the data and use ETL to move it into their environment."

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
JHADDAD3380
50%
50%
JHADDAD3380,
User Rank: Apprentice
11/16/2012 | 2:05:06 AM
re: Sears Hadoop Plans: Check Out Data Warehousing's Future
I see Hadoop as a key component of a big data analytics strategy that complements and needs to integrate with the rest of an enterprise information management infrastructure that may include legacy systems (like the mainframe), relational databases, ERP, CRM, and cloud applications, data warehouse appliances, etc. Not only are the data volumes growing exponentially but the variety of data is increasing with social media, sensor devices, call detail records, industry standards data (e.g. HL7 in healthcare, FIX, SWIFT, and market data in Financial Services, etc.), log files, and the list goes on.

It certainly makes sense to store a lot of the raw multi-structured and unstructured data in Hadoop rather than a traditional relational database. However, even if you assume over time that more and more data will be stored in Hadoop you still need to access the ever increasing variety of data from multiple organizations, residing in different systems and formats, then you need to parse and transform it on Hadoop, before you can do any useful analysis.

IG«÷m hearing from data scientists that about 80% of the work in a big data project is data integration. In fact, in one study of 35 data scientists one of them stated, G«£I spend more than half my time integrating, cleansing, and transforming data without doing any actual analysis. Most of the time IG«÷m lucky if I get to do any G«ˇanalysisG«÷ at all.G«•, (Kandel, et al. Enterprise Data Analysis and Visualization: An Interview Study. IEEE Visual Analytics Science and Technology (VAST), 2012). The need for data integration is greater today than it ever has been. The challenge is to make data integration easier and more productive on emerging technologies such as Hadoop. InformaticaG«÷s PowerCenter Big Data Edition (http://bit.ly/U25Cn8) provides a no-code development environment to visually design data integration flows and then execute them on Hadoop so that data scientists can spend more of their time doing analysis rather than integrating data.
The Agile Archive
The Agile Archive
When it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - August 27, 2014
Who wins in cloud price wars? Short answer: not IT. Enterprises don't want bare-bones IaaS. Providers must focus on support, not undercutting rivals.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.