Why Sears Is Going All-In On Hadoop - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Data Management // Big Data Analytics
04:40 PM
Connect Directly

Why Sears Is Going All-In On Hadoop

Sears pushes the cutting edge with some big data techniques, while trying to sell its big data services. Can emerging tech drive change in old-school companies?

'ETL Must Die'

Sears' move to Hadoop began as an experiment using a single node running on a netbook computer--the netbook that still sits on Shelley's office desk. Sears deployed its first production cluster of 20 to 30 nodes in early 2010. A major big data processing bottleneck then was extract, transform, and load processing, and Shelley has become a zealot about eliminating ETL.

"ETL is an antiquated technique, and for large companies it's inefficient and wasteful because you create multiple copies of data," he says. "Everybody used ETL because they couldn't put everything in one place, but that has changed with Hadoop, and now we copy data, as a matter of principle, only when we absolutely have to copy."

Sears can't eliminate ETL overnight, so it has been moving the slowest and most processing-intensive steps within ETL jobs into Hadoop. Shelley cites an ETL process that took 20 hours to run using IBM DataStage software on a cluster of distributed servers. One step that took 10 hours to run in DataStage now can run in 17 minutes on Hadoop, he says.

One downside: It takes 90 minutes to FTP the job to Hadoop and then bring results back to the ETL servers. That FTP time is a trade-off in Sears' approach of picking off one ETL step at a time. Shelley intends to keep moving steps in that process until the entire data transformation workload is on Hadoop.

"The reason we do it this way is you get a very big hit quickly," he says, noting it takes less than two weeks to get each step into production. Shelley vows to get rid of ETL eventually, "but you do it in a very nondisruptive, non-scary way for the business."

5'Pillars' From Sears Chairman Lampert

1. Lasting customer relationships Sears launched a loyalty program in 2011,expanding personalized promotions
2. Productivity and efficiency They're key to better profits, but Lampert says Sears has "fared very poorly"
3. Building brands Kenmore and Craftsman are strong, but Lampert wants them to be the "Nike and Apple of appliances, tools, and lawn and garden"
4. Reinvent Sears with tech and innovation Everyone, young and old, will use stores, online, and mobile, so Sears needs to make it easier
5. Values More information sharing, more digital tools to store employees

Shelley's "ETL must die" view has its doubters. Coming to the defense of ETL, Mike Olson, CEO of Cloudera, the leading Hadoop software distributor, recently told InformationWeek, "Almost without exception, when we see Hadoop in real customer deployments, it is stood up next to existing infrastructure that's aimed at existing business problems."

Shelley sees Hadoop as part of a larger IT ecosystem, too, and says systems such as Teradata will continue to have an important, focused role at Sears. But he's on the far end of the spectrum in terms of how much of the legacy environment Hadoop might replace. Countering Shelley's sometimes sweeping predictions of legacy system replacement, Olson says: "It's unlikely that a brand-new entrant to the market [like Hadoop] is going to displace tools for established workloads."

Scaling Out

Sears' main Hadoop cluster has nearly 300 nodes, and it's populated with 2 PB of data--mostly structured data such as customer transaction, point of sale, and supply chain. (Hadoop systems create two copies of the data, so the total environment is 6 PB). To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.

Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Shelley says Sears can refactor and combine as needed quickly and efficiently within Hadoop.

Hadoop isn't a science project at Sears--critical reports run on the platform, including financial analyses; SEC reporting; logistics planning; and analyses of supply chains, products, and customer data. For ad hoc query and analysis, Sears uses Datameer, a spreadsheet-style tool that supports data exploration and visualization directly on Hadoop, without copying or moving data. Using Datameer, Sears can develop in three days interactive reports that used to take IT six to 12 weeks, Shelley says. The old approach required intensive IT support for ETL, data cubing, and associated testing. Now line-of-business power users are developing most of the new reports.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
2 of 3
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
Soozy G. Miller
Soozy G. Miller,
User Rank: Strategist
12/5/2012 | 3:34:31 PM
re: Why Sears Is Going All-In On Hadoop
"But will companies be interested in buying big data cloud and consulting services from Sears?"

So, what, is Sears planning on entering a whole new market? Then again, The Gap started by selling music and jeans.
Soozy G. Miller
Soozy G. Miller,
User Rank: Strategist
12/5/2012 | 3:31:54 PM
re: Why Sears Is Going All-In On Hadoop
@srentner good point. It's great that Sears has a great big data solution; it's quite another to get the analytics to put that big data to use.
User Rank: Apprentice
10/31/2012 | 7:59:28 PM
re: Why Sears Is Going All-In On Hadoop
The final questions are extremely apropos and often cause the most confusion: "Could quick analytical access to an entire decade of medical record data change how doctors diagnose and treat patients? Could faster processing spot financial services fraud more effectively?" This is not what Hadoop does. It is not an analytics technology, as pointed out in page 1. Extracting this type of valuable insight from the data requires a new class of analytics technologies, and the more powerful the mathematical algorithms, the faster and more accurate the insight.
Ellis Booker
Ellis Booker,
User Rank: Moderator
10/31/2012 | 3:07:19 PM
re: Why Sears Is Going All-In On Hadoop
This is a big story on a number of fronts. First, it clearly expresses the value of big data analysis for retailers. As one of the Sears executives puts it, "With Hadoop we can keep everything, which is crucial because we don't want to archive or delete meaningful data." Second, it addresses the oft-heard complaint that big data solutions are prohibitively expensive--in fact, Sears says it reduced mainframe costs by more than $500,000 per year. Finally, the installation moves the retailer closer to real-time analysis: "Sears can develop in three days interactive reports that used to take IT six to 12 weeks." --Ellis Booker, InformationWeek Community Editor
How to Create a Successful AI Program
Jessica Davis, Senior Editor, Enterprise Apps,  10/14/2020
Think Like a Chief Innovation Officer and Get Work Done
Joao-Pierre S. Ruth, Senior Writer,  10/13/2020
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Flash Poll