Like many retailers, Sears Holdings, the parent of Sears and Kmart, is trying to get closer to its customers. At Sears' scale, that requires big-time data analysis capabilities, but three years ago, Sears' IT wasn't really up to the task.
"We wanted to personalize marketing campaigns, coupons, and offers down to the individual customer, but our legacy systems were incapable of supporting that," says Phil Shelley, Sears' executive VP and CTO, in a meeting with InformationWeek editors and his team at company headquarters in suburban Chicago.
Improving customer loyalty, and with it sales and profitability, is desperately important to Sears as it faces fierce competition from Wal-Mart and Target, as well as online retailers such as Amazon.com. While revenue at Sears has declined, from $50 billion in 2008 to $42 billion in 2011, big-box rivals Wal-Mart and Target have grown steadily, and they're far more profitable. Meantime, Amazon has gone from $19 billion in revenue in 2008 to $48 billion last year, passing Sears for the first time.
A Shop Your Way Rewards membership program started by Sears in 2011 is part of a five-part strategy to get the company back on track. Behind the scenes is a cutting-edge implementation of Apache Hadoop, the high-scale, open source data processing platform driving the big data trend. Despite Sears' less-than-cutting-edge reputation as a retailer, the company has been an innovator in using big data. In fact, Shelley is leading a Sears subsidiary, MetaScale, that's pitching services to help companies outside retail use Hadoop.
But will companies be interested in buying big data cloud and consulting services from Sears? And can Sears' own big data efforts help the company regain its footing in the retail industry?
Fast And Agile
Sears' process for analyzing marketing campaigns for loyalty club members used to take six weeks on mainframe, Teradata, and SAS servers. The new process running on Hadoop can be completed weekly, Shelley says. For certain online and mobile commerce scenarios, Sears can now perform daily analyses. What's more, targeting is more granular, in some cases down to the individual customer. Whereas the old models made use of 10% of available data, the new models run on 100%.
"The Holy Grail in data warehousing has always been to have all your data in one place so you can do big models on large data sets, but that hasn't been feasible either economically or in terms of technical capabilities," Shelley says, noting that Sears previously kept data anywhere from 90 days to two years. "With Hadoop we can keep everything, which is crucial because we don't want to archive or delete meaningful data."
Sears is still the largest appliance retailer and appliance service provider in the U.S., for example, so it's in a strong position to understand customer needs, service trends, warranty problems, and more. But Sears has only been scratching the surface of using available data.
Enter Hadoop, an open source data processing platform gaining adoption on the strength of two promises: ultra-high scalability and low cost compared with conventional relational databases. Hadoop systems at 200 terabytes cost about one-third of 200-TB relational platforms, and the differential grows as scale increases into the petabytes, according to Sears. With Hadoop's massively parallel processing power, Sears sees little more than one minute's difference between processing 100 million records and 2 billion records.
The downside of Hadoop is that it's an immature platform, perplexing to many IT shops, and Hadoop talent is scarce. Sears learned Hadoop the hard way, by trial and error. It had few outside experts available to guide its work when it embraced the platform in early 2010.
The company is now in the enviable position of having big data experience among its employees in the U.S. and India. MetaScale will leverage Sears' data center capacity in Chicago and Detroit, just as Amazon Web Services takes advantage of Amazon's massive e-commerce compute capacity.
Open Source Moves In
Sears' embrace of an open source stack began at the operating system level, with Linux. Sears routinely replaces legacy Unix systems with Linux rather than upgrade them, Shelley says, and it has retired most of its Sun and HP-UX servers. Microsoft server and development technologies are also on the way out.
Moving up the stack, Sears is consolidating its databases to MySQL, InfoBright, and Teradata--EMC Greenplum, Microsoft SQL Server, and Oracle (including four Exadata boxes) are on their way out, Shelley says.
Hadoop's power comes from dividing workloads across many commodity Intel x86 servers, each with multiple CPUs and each CPU with multiple processor cores. Since early 2010, Sears has been moving batch data processing off its mainframes and into Hadoop. Cost is the big motivator, as mainframe MIPS cost anywhere from $3,000 to $7,000 per year, Shelley says, while Hadoop costs are a small fraction of that.
Sears says it has surpassed its initial target to reduce mainframe costs by $500,000 per year, while also delivering "at least 20, sometimes 50, up to 100 times better performance on batch times," Shelley says. Eliminating all of the mainframes in use would enable it to save "tens of millions" of dollars, he says.
'ETL Must Die'
Sears' move to Hadoop began as an experiment using a single node running on a netbook computer--the netbook that still sits on Shelley's office desk. Sears deployed its first production cluster of 20 to 30 nodes in early 2010. A major big data processing bottleneck then was extract, transform, and load processing, and Shelley has become a zealot about eliminating ETL.
"ETL is an antiquated technique, and for large companies it's inefficient and wasteful because you create multiple copies of data," he says. "Everybody used ETL because they couldn't put everything in one place, but that has changed with Hadoop, and now we copy data, as a matter of principle, only when we absolutely have to copy."
Sears can't eliminate ETL overnight, so it has been moving the slowest and most processing-intensive steps within ETL jobs into Hadoop. Shelley cites an ETL process that took 20 hours to run using IBM DataStage software on a cluster of distributed servers. One step that took 10 hours to run in DataStage now can run in 17 minutes on Hadoop, he says.
One downside: It takes 90 minutes to FTP the job to Hadoop and then bring results back to the ETL servers. That FTP time is a trade-off in Sears' approach of picking off one ETL step at a time. Shelley intends to keep moving steps in that process until the entire data transformation workload is on Hadoop.
"The reason we do it this way is you get a very big hit quickly," he says, noting it takes less than two weeks to get each step into production. Shelley vows to get rid of ETL eventually, "but you do it in a very nondisruptive, non-scary way for the business."
Shelley's "ETL must die" view has its doubters. Coming to the defense of ETL, Mike Olson, CEO of Cloudera, the leading Hadoop software distributor, recently told InformationWeek, "Almost without exception, when we see Hadoop in real customer deployments, it is stood up next to existing infrastructure that's aimed at existing business problems."
Shelley sees Hadoop as part of a larger IT ecosystem, too, and says systems such as Teradata will continue to have an important, focused role at Sears. But he's on the far end of the spectrum in terms of how much of the legacy environment Hadoop might replace. Countering Shelley's sometimes sweeping predictions of legacy system replacement, Olson says: "It's unlikely that a brand-new entrant to the market [like Hadoop] is going to displace tools for established workloads."
Sears' main Hadoop cluster has nearly 300 nodes, and it's populated with 2 PB of data--mostly structured data such as customer transaction, point of sale, and supply chain. (Hadoop systems create two copies of the data, so the total environment is 6 PB). To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.
Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Shelley says Sears can refactor and combine as needed quickly and efficiently within Hadoop.
Hadoop isn't a science project at Sears--critical reports run on the platform, including financial analyses; SEC reporting; logistics planning; and analyses of supply chains, products, and customer data. For ad hoc query and analysis, Sears uses Datameer, a spreadsheet-style tool that supports data exploration and visualization directly on Hadoop, without copying or moving data. Using Datameer, Sears can develop in three days interactive reports that used to take IT six to 12 weeks, Shelley says. The old approach required intensive IT support for ETL, data cubing, and associated testing. Now line-of-business power users are developing most of the new reports.
The MetaScale Mission
Shelley is still CTO of Sears, but if his portrayal of all the things Hadoop can do sounds a bit rosy, keep in mind that he's also now CEO of MetaScale, a division that Sears is hoping will make money from the company's specialized big data expertise.
The rarest commodity that MetaScale offers is Sears' experience in bringing mainframe data into the Hadoop world. Old-school Cobol programmers at Sears were initially Hadoop skeptics, Shelley says, but many turned out to be eager and highly skilled adopters of the Pig language for running MapReduce jobs on Hadoop. Tasks that required 3,000 to 5,000 lines of Cobol can be reproduced with a few hundred lines of Pig, he says. The company learned how to load data from IMS (mainframe) databases into Hadoop and bring result sets back into mainframe apps. That's not trivial work because it involves a variety of compressed data format transformations, and packing and unpacking of data.
MetaScale's business model is to run Hadoop clusters for other companies as a subscription cloud service in Sears' data center. Or Sears will remotely manage clusters in a customer's data center, a setup that two early customers, one in healthcare and the other in financial services, both want for regulatory reasons. Monthly fees are based on the volume of terabytes supported, and customers can buy out deployments if they want to take them over and run them themselves.
MetaScale also offers data architecture, modeling, and management services and consulting. The big idea behind Hadoop is to bring in as much data as possible while keeping data structures simple. "People want to overcomplicate things by representing data and dividing things up into separate files," says Scott LaCosse, director of data management at Sears and MetaScale. "The object is not to save space, it's to eliminate joins, denormalize the data, and put it all in one big file where you can analyze it."
It's an approach that's counterintuitive for a SQL veteran, so a big part of MetaScale's work is to help customers change their thinking: You apply schema as you pull data out to use it, rather than take the relational database approach of imposing a schema on data before it's loaded onto the platform. Hadoop holds data in its raw form, giving users the flexibility to combine and examine the data in many ways over time.
"If in three years you come up with a new query or analysis, it doesn't matter because there's no schema," Shelley says. "You just go get the raw data and transform it into any format you need."
For all of Shelley's boldness about replacing legacy systems, he's careful to describe Hadoop as part of an ecosystem. Sears still uses Teradata and InfoBright, for example, when applications call for fast analysis. But Hadoop is the center of Sears' data management strategy, handling the large-scale heavy lifting, while relational tools take tactical roles.
So where should Hadoop adopters begin?
"You have to go fast and be bold without taking stupid risks," Shelley says. Start with a business need "that causes enough pain that people will notice and they'll see tangible benefits."
Sears itself still has a lot to prove with its own use of Hadoop to solve huge business problems, such as offering customers personalized promotions. Shelley cites plenty of conceptual uses of Hadoop, and he sprinkles in details on speed-and-feed gains, but he doesn't offer clear cases of tangible benefits the retailer has realized. The company is well along in adopting Hadoop and in developing specialized expertise that might benefit MetaScale customers--particularly those using mainframes--but will Hadoop really help turn Sears around?
Sears' latest results for the quarter ended July 28 show that earnings before interest, taxes, depreciation, and amortization were up 163%, to $153 million, from $58 million in the year-earlier quarter. But same-store sales were down 2.9% at Sears and 4.7% at Kmart. Sears' spin is that it's selling fewer items more profitably, which could be in part because of smarter targeting and promotion. But Sears can't shrink its way back to greatness. As Wal-Mart and Target gain share, their buying power and ability to press Sears on margins only grows.
Would-be MetaScale customers in other industries will face different challenges as they consider embracing Hadoop. Could quick analytical access to an entire decade of medical record data change how doctors diagnose and treat patients? Could faster processing spot financial services fraud more effectively? Companies are focused on choosing and building out the next-generation platforms that will handle those big data jobs. Will Hadoop be that platform, and will Hadoop help turn MetaScale into a successful pioneer? That's a story that has yet to unfold.