As the volume, variety and velocity of available data all continue to grow at astonishing rates, businesses face two urgent challenges: how to uncover actionable insights within this data, and how to protect it. Both of these challenges depend directly on a high level of data governance.
The Hadoop ecosystem can provide that level of governance using a metadata approach, ideally on a single data platform.
A new approach to governance is needed for several reasons. In the age of big data, data is scattered throughout the enterprise. It’s in structured, unstructured, semi-structured and various other formats. Furthermore, the sources of the data are not under the control of the teams that need to manage it.
In this environment, data governance includes three important goals:
- Maintaining the quality of the data
- Implementing access control and other data security measures
- Capturing the metadata of datasets to support security efforts and facilitate end-user data consumption
Solutions within the Hadoop Ecosystem
One way to approach big data governance in a Hadoop environment is through data tagging. In this approach, the metadata that will govern the data’s use is embedded with that data as it passes through various enterprise systems. Furthermore, this metadata is enhanced to include information beyond common attributes like filesize, permissions, modification dates and so on. For example, it might include business metadata that would help a data scientist evaluate its usefulness in a particular predictive model.
Finally, unlike enterprise data itself, metadata can be centralized on a single platform.
The standard Hadoop Distributed Filing System HDFS has an extended attributes capability that allows enriched metadata, but it isn’t always adequate for big data. Fortunately, an alternate solution exists. The Apache Atlas metadata management system enables data tagging, and can also serve as a centralized metadata store, one that can offer “one stop shopping” for data analysts who are searching for relevant datasets. Also, users of the popular Hadoop-friendly Hive and Spark SQL data retrieval systems can do the tagging themselves.
For security, Atlas can be integrated with Apache Ranger, a system that provides role-based access to Hadoop platforms.
Platform loading challenges
The initial loading of metadata to the Atlas platform and incremental loading that will follow both present significant challenges. For large enterprises, the sheer volume of data will be the main problem in the initial phase, and it may be necessary to optimize some code in order to carry out this phase efficiently.
Incremental loading is a more complex issue, because tables, indexes and authorized users change all the time. If these changes aren’t quickly reflected in the available metadata, the ultimate result is a reduction in the quality of the data available to end users. To avoid this problem, event listeners should be included in the system’s building blocks so that changes can be captured and processed in near real time. A real-time solution not only means better data quality. It also improves developer productivity because the developers don’t have to wait for a batch process.
The foundation of digital transformation
As businesses pursue digital transformation and seek to be more data-driven, senior management needs to be aware that no results in this direction can be achieved without quality data, and that requires strong data governance. When big data is involved, governance based on enhanced metadata that resides in a central repository is a solution that works.
Aroop Maliakkal Padmanabhan is a Senior Manager on the Platform Engineering team at eBay. He leads the Hadoop team, which owns one of the biggest Hadoop clusters in the world. He has been actively working in the Hadoop space since 2008.
Tiffany Nguyen is a senior software engineer at eBay and has been a data enthusiast since 2015. She currently leads the data governance initiative on big data platform at eBay.