'ETL Must Die'
Sears' move to Hadoop began as an experiment using a single node running on a netbook computer--the netbook that still sits on Shelley's office desk. Sears deployed its first production cluster of 20 to 30 nodes in early 2010. A major big data processing bottleneck then was extract, transform, and load processing, and Shelley has become a zealot about eliminating ETL.
"ETL is an antiquated technique, and for large companies it's inefficient and wasteful because you create multiple copies of data," he says. "Everybody used ETL because they couldn't put everything in one place, but that has changed with Hadoop, and now we copy data, as a matter of principle, only when we absolutely have to copy."
Sears can't eliminate ETL overnight, so it has been moving the slowest and most processing-intensive steps within ETL jobs into Hadoop. Shelley cites an ETL process that took 20 hours to run using IBM DataStage software on a cluster of distributed servers. One step that took 10 hours to run in DataStage now can run in 17 minutes on Hadoop, he says.
One downside: It takes 90 minutes to FTP the job to Hadoop and then bring results back to the ETL servers. That FTP time is a trade-off in Sears' approach of picking off one ETL step at a time. Shelley intends to keep moving steps in that process until the entire data transformation workload is on Hadoop.
"The reason we do it this way is you get a very big hit quickly," he says, noting it takes less than two weeks to get each step into production. Shelley vows to get rid of ETL eventually, "but you do it in a very nondisruptive, non-scary way for the business."
Shelley's "ETL must die" view has its doubters. Coming to the defense of ETL, Mike Olson, CEO of Cloudera, the leading Hadoop software distributor, recently told InformationWeek, "Almost without exception, when we see Hadoop in real customer deployments, it is stood up next to existing infrastructure that's aimed at existing business problems."
Shelley sees Hadoop as part of a larger IT ecosystem, too, and says systems such as Teradata will continue to have an important, focused role at Sears. But he's on the far end of the spectrum in terms of how much of the legacy environment Hadoop might replace. Countering Shelley's sometimes sweeping predictions of legacy system replacement, Olson says: "It's unlikely that a brand-new entrant to the market [like Hadoop] is going to displace tools for established workloads."
Sears' main Hadoop cluster has nearly 300 nodes, and it's populated with 2 PB of data--mostly structured data such as customer transaction, point of sale, and supply chain. (Hadoop systems create two copies of the data, so the total environment is 6 PB). To give a sense of how early Sears was to Hadoop development, Wal-Mart divulged early this year that it was scaling out an experimental 10-node Hadoop cluster for e-commerce analysis. Sears passed that size in 2010.
Sears now keeps all of its data down to individual transactions (rather than aggregates) and years of history (rather than imposing quarterly windows on certain data, as it did previously). That's raw data, which Shelley says Sears can refactor and combine as needed quickly and efficiently within Hadoop.
Hadoop isn't a science project at Sears--critical reports run on the platform, including financial analyses; SEC reporting; logistics planning; and analyses of supply chains, products, and customer data. For ad hoc query and analysis, Sears uses Datameer, a spreadsheet-style tool that supports data exploration and visualization directly on Hadoop, without copying or moving data. Using Datameer, Sears can develop in three days interactive reports that used to take IT six to 12 weeks, Shelley says. The old approach required intensive IT support for ETL, data cubing, and associated testing. Now line-of-business power users are developing most of the new reports.