News

Why Sears Is Going All-In On Hadoop

Doug Henschen
Executive Editor, InformationWeek



(Page 2 of 3)

'ETL Must Die'


More Global CIO Insights

Webcasts

More >>

White Papers

More >>

Reports

More >>

Sears' move to Hadoop began as an experiment using a single node running on a netbook computer--the netbook that still sits on Shelley's office desk. Sears deployed its first production cluster of 20 to 30 nodes in early 2010. A major big data processing bottleneck then was extract, transform, and load processing, and Shelley has become a zealot about eliminating ETL.

"ETL is an antiquated technique, and for large companies it's inefficient and wasteful because you create multiple copies of data," he says. "Everybody used ETL because they couldn't put everything in one place, but that has changed with Hadoop, and now we copy data, as a matter of principle, only when we absolutely have to copy."

Sears can't eliminate ETL overnight, so it has been moving the slowest and most processing-intensive steps within ETL jobs into Hadoop. Shelley cites an ETL process that took 20 hours to run using IBM DataStage software on a cluster of distributed servers. One step that took 10 hours to run in DataStage now can run in 17 minutes on Hadoop, he says.

One downside: It takes 90 minutes to FTP the job to Hadoop and then bring results back to the ETL servers. That FTP time is a trade-off in Sears' approach of picking off one ETL step at a time. Shelley intends to keep moving steps in that process until the entire data transformation workload is on Hadoop.

"The reason we do it this way is you get a very big hit quickly," he says, noting it takes less than two weeks to get each step into production. Shelley vows to get rid of ETL eventually, "but you do it in a very nondisruptive, non-scary way for the business."

5'Pillars' From Sears Chairman Lampert

1. Lasting customer relationships Sears launched a loyalty program in 2011,expanding personalized promotions
2. Productivity and efficiency They're key to better profits, but Lampert says Sears has "fared very poorly"
3. Building brands Kenmore and Craftsman are strong, but Lampert wants them to be the "Nike and Apple of appliances, tools, and lawn and garden"
4. Reinvent Sears with tech and innovation Everyone, young and old, will use stores, online, and mobile, so Sears needs to make it easier
5. Values More information sharing, more digital tools to store employees

Shelley's "ETL must die" view has its doubters. Coming to the defense of ETL, Mike Olson, CEO of Cloudera, the leading Hadoop software distributor, recently told InformationWeek, "Almost without exception, when we see Hadoop in real customer deployments, it is stood up next to existing infrastructure that's aimed at existing business problems."

Shelley sees Hadoop as part of a larger IT ecosystem, too, and says systems such as Teradata will continue to have an important, focused role at Sears. But he's on the far end of the spectrum in terms of how much of the legacy environment Hadoop might replace. Countering Shelley's sometimes sweeping predictions of legacy system replacement, Olson says: "It's unlikely that a brand-new entrant to the market [like Hadoop] is going to displace tools for established workloads."

Scaling Out

Sears' main Hadoop cluster has nearly 300 nodes, and it's populated with 2 PB of data--mostly structured data such as customer transaction, point of sale, and supply chain. (Hadoop systems create two copies of the data, so the total environment is 6 PB). To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.

Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Shelley says Sears can refactor and combine as needed quickly and efficiently within Hadoop.

Hadoop isn't a science project at Sears--critical reports run on the platform, including financial analyses; SEC reporting; logistics planning; and analyses of supply chains, products, and customer data. For ad hoc query and analysis, Sears uses Datameer, a spreadsheet-style tool that supports data exploration and visualization directly on Hadoop, without copying or moving data. Using Datameer, Sears can develop in three days interactive reports that used to take IT six to 12 weeks, Shelley says. The old approach required intensive IT support for ETL, data cubing, and associated testing. Now line-of-business power users are developing most of the new reports.

Page 3: 
« Previous Page  | 1 |  2 | 3  | Next Page » 

Related Reading


Informationweek Discussions

Start the Discussion


InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
Subscribe to RSS

Resource Links