Best Practices for Deploying Data Lakes - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Blogs
Commentary
8/9/2019
07:00 AM
John Gray, CTO, Infiniti
John Gray, CTO, Infiniti
Commentary
50%
50%

Best Practices for Deploying Data Lakes

With data changing and growing so rapidly, the need to get value out of your data is even more urgent. Here's some advice about how to approach data lakes.

Although still a burgeoning term, data lakes have recently gained more recognition among IT teams as data increasingly becomes a foundation of modern business. Conceived as a solution to reduce data sprawl and data siloes, data lakes emerged from the industry of data warehousing, which targeted the frustrations IT encountered when trying to create an organized repository of strategic datasets on which to make key business decisions. This use can range from data analytics to better understand customer needs to artificial intelligence to solve for real-time challenges.

Data lakes, in many ways, are an evolution of data warehousing. Many data warehouse projects failed: They were too costly, took too long, and only achieved a small subset of the original goals. With data changing and growing so rapidly, the need for quickly getting value out of data has grown ever more pressing. Nobody can afford to spend months or years analyzing and modeling data for business use. By the time the data is usable in a data warehouse, the business needs have changed.

In a similar vein to data warehouses, data marts emerged to embrace data with a specific use or cataloged by a certain quality (marketing departmental data, for example). Data marts have been more successful because the usage of the data is better understood, and the results can be delivered more quickly. However, the compartmentalized nature of data marts has made them less useful to businesses that have massive amounts of data and that need to use that data cross-functionally and across several parties.

For this reason, data lakes have developed out of a need to meet business needs at scale. They are intended to speed things up, making data more readily usable for previously undefined needs. The emergence of truly large-scale cloud computing with its massive cheap compute power and almost infinite storage has made this data lake approach viable.

Since data lakes are still a rather new concept, the market hasn’t yet fully adapted to them. Therefore, early adopters will see the most value from data lakes at this time, perhaps in using them to empower artificial intelligence within daily business. Beyond those who have already embraced data lakes, many IT teams are assessing them to find the right solution for their business. What can be done to properly deploy a data lake? Here are my suggestions for three best practices to follow:

1. Put data into a data lake with a strategy

The core reason behind keeping a data lake is using that data for a purpose. Although in theory a data lake should serve many, yet to be defined uses, it is better to start out knowing something about how the data will be used. Consider how you will gain value from a data lake beyond storage. As with any IT initiative, it’s important to first match a data lake’s deployment to a concrete strategy that not only aligns with IT goals, but long-term business goals as well.

Ask yourself if keeping a data lake will assist the business in leveraging its data. Keeping data for use “down the road” is costly if down the road is years from now. If a business doesn’t intend to use their data for a specific purpose in the short-term, it becomes wasted funds to store that data.

2. Keep data at the lowest level of granularity -- and tag it

Storing data at the most detailed level allows the data to be assembled, aggregated, and otherwise manipulated for a myriad of purposes. Don’t aggregate or summarize the data prior to storing it in the data lake. Because the value of having a data lake will not be realized until a business can make use of the data within it, it is better to put data into the lake with tagging and cataloging, so that when needed, IT can sift through the repository to pull out assets. The use of tagging, which is needed for reporting, can help to enable analytics projects. Also, machine learning (ML) and AI can aid in the tagging process by sifting through existing data and creating tags.

Additionally, companies can use these data analytics, ML, and AI projects to drive overall improved competitiveness for the business. One tool can empower another.

3. Have a data destruction plan

Too often companies accumulate large amounts of data without any plan in place to get rid of unnecessary assets. Especially if there’s a compliance obligation to destroy information after a certain time-lapse (as GDPR tasks companies to do with EU citizen data), not having a destruction plan can be a roadblock to performing these duties.

Pairing a destruction plan with your data lake can help you retrieve what needs to be destroyed and when. It can also solve for scenarios in which businesses are required to track where all client data resides: having a single location simplifies cost and saves time.

Preparing for the future

As increased amounts of data proliferate the business landscape, there will continue to be a need to store and use that data in a strategic fashion. Data lakes are emerging as a great way to drive empowerment that unlocks the value of data for the business. In considering a data lake solution, first determine how you think your organization will use the data, then where you’ll put it. For example, the cloud has great appeal for data lakes due to the lowered storage costs. If the cloud makes sense to your company goals, examine a third-party provider that can meet your unique infrastructure needs. How will the cloud services provider or your own DevOps build a process into your data lake so that data can be loaded and lifted from the lake according to objectives?

Since undoubtedly there will be a lot of processing to gain full value from having a data lake, consider where steps in the analytics process can be automated. You also need staff skilled in building the infrastructure to host a data lake, to load data into the data lake, and to transform the data for use. Establishing regular, open communications between IT and business leaders is a good first step to enable any IT transformation, such as a data lake solution.

John Gray is CTO of Infiniti Consulting group, an InterVision company that, as a leading strategic services provider (SSP), has assisted IT leaders in solving the most crucial business challenges they face. For 25 years, the company has helped IT leaders transform their business by solving for the right technology, deployed on the right premise, and managed through the right model to fit their unique demands and long-term goals.

The InformationWeek community brings together IT practitioners and industry experts with IT advice, education, and opinions. We strive to highlight technology executives and subject matter experts and use their knowledge and experiences to help our audience of IT ... View Full Bio
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
State of the Cloud
State of the Cloud
Cloud has drastically changed how IT organizations consume and deploy services in the digital age. This research report will delve into public, private and hybrid cloud adoption trends, with a special focus on infrastructure as a service and its role in the enterprise. Find out the challenges organizations are experiencing, and the technologies and strategies they are using to manage and mitigate those challenges today.
Commentary
Enterprise Guide to Digital Transformation
Cathleen Gagne, Managing Editor, InformationWeek,  8/13/2019
Slideshows
IT Careers: How to Get a Job as a Site Reliability Engineer
Cynthia Harvey, Freelance Journalist, InformationWeek,  7/31/2019
Commentary
AI Ethics Guidelines Every CIO Should Read
Guest Commentary, Guest Commentary,  8/7/2019
Register for InformationWeek Newsletters
Video
Current Issue
Data Science and AI in the Fast Lane
This IT Trend Report will help you gain insight into how quickly and dramatically data science is influencing how enterprises are managed and where they will derive business success. Read the report today!
White Papers
Slideshows
Twitter Feed
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Sponsored Video
Flash Poll