Extract, transform and load (ETL) processes have been the way to move and prepare data for analysis within data warehouses, but will the rise of Hadoop bring the end of ETL?
Many Hadoop advocates argue that this data-processing platform is an ideal place to handle data transformation, as it offers scalability and cost advantages over conventional ETL software and server infrastructure. Defenders of ETL argue that handling the transformation step on Hadoop does not do away with the need for extract and load; nor does it address data-quality and data-governance requirements information management professionals have been working on for decades.
In our debate, Phil Shelley, the chief technology officer at Sears Holdings and the CEO of its big data consulting and services offshoot, MetaScale, says we're witnessing the end of ETL. James Markarian, chief technology officer at information management vendor Informatica, says ETL is changing but will live on.
What's your view on this raging debate? Use the commenting tool below the article to challenge these experts and share your view.
For The Motion
CTO Sears Holdings, CEO Metascale
ETL's Days Are Numbered
The foundation of any IT system is the data. Nothing of value can be done without generating, manipulating and consuming data. When we lived in the world of monolithic mainframe systems that were disconnected, data mostly stayed within that system and was consumed via screen or paper printout. Since that time ended, we live in a world of separate systems and interconnects between them. ETL (extract-transform-load) was born, and we began to copy and reuse data. Reuse of data rarely happens without some form of aggregation, transformation and re-loading into another system.
The growth of ETL has been alarming, as data volumes escalate year after year. Companies have significant investment in people, skills, software and hardware to do nothing but ETL. Some consider ETL to be a bottleneck in IT operations: ETL takes time as, by definition, data has to be moved. Reading from one system, copying over a network and writing all take time -- ever growing blocks of time, causing latency in the data before it can be used. ETL is expensive in terms of people, software licensing and hardware. ETL is a non- value-added activity too, as the data is unusable until it lands in the destination system.
So why do we still do ETL? Mostly because systems that generate data are not the ones that transform or consume data. How about changing all that, as it seems to make no sense that we spend time and money on non-value-added activities?
Well, historical systems were not large enough to cost-effectively store, transform, analyze and report or consume data in a single place. Times and technology change, of course, and since Hadoop came to the enterprise, we are beginning to see the end of ETL as we know it. This is not just an idea or a desire, it is really possible and the evolution is underway.
With Hadoop as a data hub in an enterprise data architecture, we now have a cost-effective, extreme-performance environment to store, transform and consume data, without traditional ETL.
Here is how it works:
- Systems generate data, just as they always have.
- As near to real-time as possible, data is loaded into Hadoop -- yes, this is still "E" from traditional ETL, but that is where the similarity ends.
- Now we can aggregate, sort, transform and analyze the data inside Hadoop. This is the "T"and the "L" from traditional ETL.
- Data latency is reduced to minutes instead of hours because the data never leaves Hadoop. There is no network copying time, no licenses for ETL software and no additional ETL hardware.
- Now the data can be consumed in place without moving it. There are a number of graphic analytic and reporting options to consume data without moving large amounts of data out of Hadoop.
- Some subsets of data do have to be moved out of Hadoop into other systems, for specific purposes. However, with a strong and coherent enterprise data architecture, this can be managed to be the exception.
So, ETL as we know it is gradually fading to be the exception rather than the norm. This is a journey, not a binary change. But in our case at Sears and for other companies, case-by-case, gradually, but certainly, ETL is becoming history.
Phil Shelley is CTO at Sears Holdings, leading IT operations. He is also CEO of MetaScale, a Sears Holdings subsidiary that designs, delivers and operates Hadoop-based solutions for analytics, mainframe migration and massive-scale processing.
Against The Motion
Don't Be Naive About Data Integrity
The stunning thing about the current buzz and questions heralding the end of ETL and even data warehousing is the lack of pushback and analysis of some of the outlandish comments made. The typical assertion is that "Hadoop eliminates the need for ETL."
What no one seems to question in response to these sorts of comments is the naive assumptions these statements are based on. Is it realistic for most companies to move all of their data into Hadoop? Given the need to continue to use information that currently exists in legacy environments, probably not. Even if you did move everything into Hadoop, a path that will take years, if not decades, for most companies with existing databases, you still have to manipulate the data once it is there.
So is writing ETL scripts in MapReduce code still ETL? Sure it is. Is running ETL faster (in some cases, and slower in other cases) on Hadoop eliminating ETL? No. Or is the introduction of Hadoop changing when, where and how ETL happens? Here the answer is definitely yes.
So the question isn't really, are we eliminating ETL, but rather where does ETL take place and how are we extending or changing its definition. The "E" represents the ability to consistently and reliably extract data with high performance and minimal impact to the source system. The "T" represents the ability to transform one or more data sets in batch or real-time into a consumable format. The "L" stands for loading data into a persistent or virtual data store.
Let's look at the fundamentals of enterprise data integration, partly manifested by ETL processes:
- Data needs to flow from source applications into analytic data stores in a controlled, reliable, secure manner.
- Information needs to be standardized, with regards to semantics, format and lexicon, for accurate analysis.
- Operational results need to be consistent and repeatable.
- Operational results need to be verifiable and transparent -- where did information come from, who touched it, who viewed it, what transformations and calculations were performed on it, what does it mean, etc.?
What we ordinarily hear regarding new big data environments is that the data appears by some form of osmosis. We want every last bit of it for new insights, and don't worry about semantics and terminology -- those discrepancies just make the results more interesting. These kind of dreamy aspirations are seductive but deceptive. It's also just the start of a path toward relearning all the reasons why data practitioners developed best practices around accessing data, profiling data, discovering relationships, handling metadata, explaining context, transforming data, cleansing data, governing data for compliance and delivering information at various latencies using current-generation integration technologies.
Modern data integration tools and platforms ensure timely, trusted, relevant, secure and authoritative data. Modern integration technologies use optimizers to process information in both scale-up and scale-out architectures, push processing into database management systems, and push processing -- not just data -- into Hadoop. They broker and publish a data layer that abstracts processing such that multiple applications can consume and benefit from secure and curated datasets.
ETL no doubt needs to continue to evolve and adapt to developer preferences and the performance, scale and latency needs of modern applications. Hadoop is just another engine upon which ETL and its associated technologies (like data quality and data profiling) can run. Renaming what is commonly referred to as ETL, or worse, ignorantly dismissing data challenges and enterprise-wide data needs, is just irresponsible.
James Markarian is executive VP and CTO at Informatica with responsibility for the strategic direction of Informatica products and platforms. He also runs the corporate development group, including acquisitions.