Kimball University: Three ETL Compromises to Avoid
Why neglecting slowly changing dimensions, failing to capture metadata and overlooking scope creep can be the undoing of a dimensional data warehousing initiative.
Whether you are developing a new dimensional data warehouse or replacing an existing environment, the ETL (extract, transform, load) implementation effort is inevitably on the critical path. Difficult data sources, unclear requirements, data quality problems, changing scope, and other unforeseen problems often conspire to put the squeeze on the ETL development team. It simply may not be possible to fully deliver on the project team's original commitments; compromises will need to be made. In the end, these compromises, if not carefully considered, may create long-term headaches.
In my last article on "Six Key Decisions for ETL Architectures," I described the decisions ETL teams face when implementing a dimensional data warehouse. This article focuses on three common ETL development compromises that cause most of the long-term problems around dimensional data warehouses. Avoiding these compromises will not only improve the effectiveness of your ETL implementation, but will also increase the likelihood of overall DW/BI success.
Kimball Group has written extensively on slowly changing dimension (SCD) strategies and complementary implementation alternatives. It's important that the ETL team embrace SCDs as an important strategy early in the initial implementation process. A common compromise is to put off to the future the effort required to properly support SCDs, especially Type 2 SCDs where dimension changes are tracked by adding new rows to the dimension table. The result is often a total rework disaster.
Deferring the implementation of proper SCD strategies does save ETL development time in the immediate phase. But as a result, the implementation embraces only Type 1 SCDs, where all history in the data warehouse is associated with current dimension values. Initially, this seems to be a reasonable compromise. However, it's almost always more difficult to "do it right" when you have to circle back in a later phase. The unfortunate realities are that:
Following a successful initial implementation, the team faces pressure to roll out new capabilities and additional phases without time to revisit prior deliverables and add the required change-tracking capabilities. Thus, the rework ultimately required to support SCD requirements continues to expand.
Once the ETL team finally has the bandwidth to address SCD, the ugly truth becomes apparent. Adding SCD Type 2 capabilities into the historical data requires rebuilding every dimension that contains Type 2 attributes; each dimension will have to have its primary key rekeyed to reflect the new historically appropriate Type 2 rows. Rebuilding and rekeying even one core conformed dimension will unavoidably require reloading all impacted fact tables due to the new dimension key structures.
Facing a possible rebuild of much of the data warehouse environment, many organizations will back away from the effort. Rather than reworking the existing historical data to restate the dimension and fact tables in their correct historical context, they implement the proper SCD strategies from a point-in-time forward. By compromising the implementation of proper SCD techniques in the initial development process, the organization has lost possibly years of important historic context.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
. We've got a management crisis right now, and we've also got an engagement crisis. Could the two be linked? Tune in for the next installment of IT Life Radio, Wednesday May 20th at 3PM ET to find out.