GDPR Anniversary: Farewell to Global Data Lakes

Here’s where we’re at with the regulation and the data challenges organizations are faced with today.

Sophie Stalla-Bourdillon, Chief Privacy Officer, Immuta

May 26, 2022

4 Min Read
rules and regulation gears
tashatuvango via Adobe Stock

This month marks the fourth anniversary of the EU’s General Data Protection Regulation (GDPR). As we reflect on our world’s privacy journey, suffice it to say that the regulations are now a driving force behind an organization's data management and analytics strategy.

Privacy is now a top concern for individuals, while organizations still struggle to balance data privacy with the data analytics demand of the modern economy. We’ve seen US states such as California passing their own privacy laws, making in practice privacy by design a must-do to be able to navigate the complexity of the privacy regulatory landscape.

At the global level, it has become obvious that attempting to redirect data movements from one location to another to try to achieve compliance after the fact is a real challenge and many have chosen to ignore compliance even if it means risking fines and moving on. This strategy of negligence will expose those who choose to neglect to address the foundation of the problem: the data architecture.

The Demise of the Global Data Lake?

Recent developments triggered by data protection activism suggests that we may be close to a turning point with GDPR. Centralized stores of raw data, also known as global data lakes, are now an endangered species and could be relics of the past sooner than we think.

In a post-Schrems II world, international data transfer restrictions, sometimes called soft data localization requirements, have impacted organizations of all sizes. Recent decisions by Data Protection Agencies (DPAs), such as the Austrian DPA’s decision on Google Analytics, which has been described as one the most impactful post-‘Schrems II’ enforcement decisions, have made it clear that international data transfers based upon standard contractual clauses are doomed without the appropriate technical measures that reduce re-identification risks.

Take Facebook’s global data lake used for its ad platform as a prime example. A recently leaked company document written by Facebook’s own privacy engineers detail and lay bare the company’s privacy and data protection challenges. The document illustrates the flaws of this type of data architecture, citing that Facebook engineers don’t have “an adequate level of control and explainability over how our systems use data” and thus, “can’t confidently make controlled policy changes or external commitments such as ‘we will not use X data for Y purpose.’” According to the engineers, addressing these challenges “will require additional multi-year investment in ads and our infrastructure teams to gain control over how our systems ingest, process and egest data.”

This makes it impossible for Facebook to meet basic data protection goals -- they simply can’t enumerate all the data that they have, where it is, where it goes, and how it’s used. Unfortunately, as Facebook privacy engineers explain it, once you let the ink (or data) out of the bottle, there is no way to put it back in without restructuring the entire data stack. As acknowledged within the company’s grid on readiness and uncertainty of solutions, global data lakes can’t accommodate data localization requirements and score low on purpose limitation, transparency, and controls, as well as data provenance.

Minimizing as a Strategy

Contrary to what many lobbyists and commentators argued 10 years ago during the negotiation of the GDPR, global (and often opaque) data lakes are not the only way to build analytics capabilities.

With the limitations of global data lakes making it difficult to navigate data localization laws, there’s little room for maneuvers by global organizations or organizations outsourcing processing activities to third-party contractors located in different countries. This holds true even if the recent decisions by DPA’s have accelerated the intensity of the negotiations between the US Department of Commerce and the European Commission.

As we’ve realized the limitations of global data lakes, the only viable strategy appears to systematically track and minimize data elements and movements, and localize data storage and access, unless re-identification risks can be effectively mitigated for cross-border data access on a case-by-case basis.

The Way Forward

The good news is that soft data localization requirements are now converging with federated data architecture principles. New architectural paradigms, most notably the data mesh, have emerged as the industry is moving away from monolithic data lakes in favor of more distributed architectures.

Coined by Zhamak Dehghani in 2019, data mesh embeds purpose limitation requirements at the core of the data architecture, putting strong emphasis upon data quality and lineage, and therefore intervenability and accountability through federated data governance.

While more work is certainly needed to refine the mapping to data protection goals, this paradigm illustrates the convergence of data architecture design and data privacy. By reviving core, but often denigrated data protection principles, such as purpose limitation and data minimization, with the recent take-off of purpose-based access control, new paradigms such as data mesh will be key for the way forward with GDPR and privacy by design overall.

About the Author(s)

Sophie Stalla-Bourdillon

Chief Privacy Officer, Immuta

  • Sophie Stalla-Bourdillon is Chief Privacy Officer at Immuta and Professor in Information Technology Law and Data Governance at the University of Southampton.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights