Doug HenschenExecutive Editor, InformationWeek
Executive Editor, InformationWeek
Sears Hadoop Plans: Check Out Data Warehousing's Future
If the example of Sears can serve as our guide, Hadoop will become a popular central corporate data repository -- perhaps even the leading data repository eventually. It will take over that role not only because it can handle huge volumes of data more cost effectively than relational databases, but also because it easily ingests varied and complex data without first conforming it to a pre-defined schema, as you have to do when using a database. You can save all your data for the long term and apply schema when you need to use it, rather than imposing a schema before it's loaded onto the platform.
- The Untapped Potential of Mobile Apps for Commercial Customers
- Using InfoSphere Information Server to Integrate and Manage Big Data
White PapersMore >>
At Sears, Hadoop was first deployed three years ago and it has since become the central hub of all data management activity for the retailer. CTO Phil Shelley tells InformationWeek that Hadoop is giving Sears the flexibility and scale to make use of all the company's data. "We keep all the raw, transactional data, and because there's enough horsepower in Hadoop, you can then transform it into any form you want whenever you want on they fly rather than having to create cubes or aggregations," Shelley explains.
[ Want the inside story on big data plans at Sears? Read Why Sears Is Going All-In On Hadoop. ]
Hadoop has essentially become the enterprise data store at Sears, but that's not quite the same thing as an enterprise data warehouse. The difference is analysis, some of which can be done with the batch, MapReduce processing native to Hadoop. But the retailer is still using relational databases in many situations. InfoBright's columnar database, for example, is used for fast analysis of data aggregations that used to be created -- with much IT time and expense -- as multi-dimensional OLAP cubes. Cube building is now a thing of the past. Instead, fresh data sets are moved from Hadoop into InfoBright on a daily basis.
In another example, Sears' massive Teradata deployment continue to run high-scale, mission-critical analytical applications. "Teradata is still an important platform for us whenever we need a high-speed SQL interface," explains Shelley. "That could be when we're integrating with SAS [analytics] or doing custom analytics with SQL."
That puts Teradata in the role of analytic data mart, however, as opposed to its usual place as the enterprise data warehouse that holds all important data. Nonetheless, Sears is using more Teradata than ever, says Teradata, and perhaps that's because Hadoop enables the retailer to store and retain more data than ever. Sears is now saving data that it used to throw out and it's retaining indefinitely data that it used to keep for only 90 days or two years. More data for analysis brings more analysis.
Lots of Hadoop users share Shelley's perspective on how it can become a central hub for data management -- longtime Hadoop shop JP Morgan Chase started envisioning this role years ago. In fact, at last month's Strata New York event it seemed that the focus on Hadoop has shifted. The questions are no longer "what is Hadoop" and "does it make sense for my company?" People are now asking, "do I have the people I need to run Hadoop," and "how will I analyze and make use of all that information?"
For now, moving boiled-down data sets from Hadoop into existing relational environments will be part of the answer, but that approach involves data-movement delays that plenty of practitioners would like to avoid. "The BI industry has still got its head in the sand mostly because they're all still thinking about moving and copying data," Shelley tells InformationWeek "These vendor need to get their act together and write tools that run natively on Hadoop and don't copy the data and use ETL to move it into their environment."