How Much Data Is Too Much for Organizations to Derive Value?
Data may be the new oil, but too much of a good thing can make a valuable resource an expensive risk.
Everyone wants more data. It can be used for any number of purposes that drive more revenue and efficiencies for enterprises. With AI now the buzzword on everyone’s lips, the hunger for data is only becoming more insatiable. Today, 64% of organizations manage at least one petabyte of data, and 41% of organizations surpass that with at least 500 petabytes of data, according to the AI & Information Management Report from data management and data governance company AvePoint.
As enterprises continue to amass data, it becomes increasingly difficult for their teams to understand what they have, how it equates to risk, and even to derive value from it. In a world where data is constantly compared to oil and gold, how much is too much?
The Costs and Risks of Data
Data is not free. Storage is a direct cost of having mass amounts of data. “Unstructured data is an area that tends to get organizations in a lot more trouble because there has been this perception, which is incorrect, that storage is free or … almost cost free, particularly with organizations moving to the cloud,” says Dana Simberkoff, chief risk, privacy, and information security officer at AvePoint.
Organizations are increasingly moving to the cloud, but that doesn’t mean that all of their data is being corralled into a single location. The State of Cloud Security report from threat detection and response company MixMode found that 78% of organizations use a multi-cloud or hybrid environment.
“If data is in multiple places, that is increasing your cost,” points out Chris Pierson, founder and CEO of cybersecurity company BlackCloak. Enterprises must also consider the cost of maintenance, which could include engineering and program analyst time.
Beyond storage and maintenance costs, data also comes with the potential cost of risk. Threat actors constantly look for ways to access and leverage the data safeguarded by enterprises. If they are successful, and many are, enterprises face a cascade of potential costs. In 2023, the average cost of a data breach was $4.45 million, according to the Cost of a Data Breach Report 2023 from Ponemon Institute and IBM Security.
Organizations may not even know what data they have and where, which only compounds their risk. “If the data is just sitting out there, and you don't need it anymore, it's sitting there as a huge pile of risk waiting to be exploited through a cyber incident, whether that's a breach or through some privacy violation,” says Christopher Wall, data protection officer and special counsel, global privacy and forensics at HaystackID, an eDiscovery services firm.
Data Governance
Enterprise leaders cannot answer the question of how much is too much without knowing what data they have and where. They need data governance. But data governance is tricky; some enterprises set out to get it done only to find their initiative fails.
Effective data governance requires input and commitment across the entire enterprise. “It should be a discussion amongst the different business groups, business sectors, business leads on what data do they need and/or want,” Pierson explains. “It really is what data do we currently have and hold that actually drives business forward? Somewhere there should be a meeting in the middle because you're going to have and be collecting way more data than you actually need.”
Data governance helps organizations grasp what data they have, where it is stored, and how it is being used. Without that information, enterprises will struggle under the behemoth weight of their data. They could suffer breaches and struggle to meet regulatory reporting requirements. “How would organizations report of breach if they don't even know what they have?” Simberkoff asks.
Without effective data governance, enterprises often cannot even put to work the data they are paying to store for any tangible benefits. “If a company doesn't have any way to mine or to use that data that they're storing … then the value of that data drops frankly to next to nothing,” says Wall.
Data Retention and Deletion
Keeping every data point ever collected is neither cost-effective nor a smart move for risk management. “People have a desire to hold on to all the data they've ever collected forever and ever, and at some point in time, [they] really should be a deprecating … some of that data,” says Pierson.
Once an enterprise is able to wrap its arms around data governance, leaders can start to ask questions about what kind of data can be deleted and when. The simple answer to the question of how much is too much boils down to value versus risk. “Start with the fundamental question: What does the company get from the data? Does it cost more to store and protect that data than the data actually provides to the organization?” says Wall.
When it comes to retention, consider why data is being collected and how long it is needed. “If you don't need the data, don't collect it. That should always be the first fundamental rule,” says Pierson. “If you actually do collect something and you do need it, use it for only the stated purpose on which it is being collected.”
For many organizations, duplicative information creates “data bloat,” according to Simberkoff. “Much data is duplicative, is repetitive … if you're able to eliminate that information, that is going to be a giant step forward [for] most organizations,” she says.
Having a snarl of duplicative data does more than increase an enterprise’s risk. “If you have a tremendous amount of redundant obsolete trivial information that actually begins to degrade the value of the information that you have,” says Simberkoff.
Data retention and deletion policies should address questions about the usefulness of data. Has it served its purpose? Does retaining it only create risk and no value? For example, a company may collect individuals’ Social Security numbers for verification purposes. “Do you need to actually have the Social Security number? Do you need to keep it?” asks Pierson.
Enterprise leaders can evaluate whether their organizations actually need to gather specific types of PII. Then, they can determine what PII they actually need to keep and for what length of time. Some information could be collected, used for its intended purpose, and then promptly deleted. Pierson points to the Transportation Security Administration (TSA) as an example. It uses facial recognition technology for verification purposes, and it deletes images and personal data after verification is completed. (TSA does note that it may keep passenger data for up to 24 months during testing and development periods.)
“The data that should be removed or deleted is going to be highly sensitive personally identifiable information that is not needed beyond its stated purpose,” Pierson explains.
The retention of some data will be dictated by regulation. For example, many financial records must be retained for seven years under the Sarbanes-Oxley Act.
Enterprise leaders must also increasingly consider privacy regulation as it relates to their collection, use, and retention of data. The General Data Protection Regulation (GDPR), as well as privacy laws in multiple countries and US states, protect various types of personal data and dictate how companies collect, store, and use that data. And the US is moving toward a federal data privacy law, the American Privacy Rights Act (APRA).
While no one regulation is identical, they aim to grant consumers rights around their data, such as the right to opt-out of data sharing, the right to rectification, and the right to be forgotten. “What all of those articulated rights have in common is a need by the business to be able to know what that data is, where it stored, and how it's being used,” says Wall.
As enterprises look to keep up with regulatory demands, teams need to develop and maintain policies. “Every organization should have a privacy or a data protection policy, both an internal one and the external facing one, which is effectively the company's contract with individuals whose data the company is using,” says Wall.
How can enterprises build these data deletion and retention policies? Who needs to have a seat at the table?
CIOs and CISOs are naturally going to be participants in the conversation, but enterprises will also need buy-in from the rest of the C-suite to understand where data is being collected and stored and how it is be used across the varied operations of an enterprise. Many different teams will need to be a part of understanding what data an organization has, what it needs to keep, and what it can delete.
“[This] requires input from … your IT security, privacy, and records management teams, as well as your legal teams, [and] the business users, the ones that actually are the consumers of that data,” says Simberkoff.
While leadership is at the helm of enterprise-wide initiatives like this, everyone at a company touches data and plays a role in maintaining the policies around it. Wall emphasizes the importance of employee training. “You have to make sure that every request whether it's a request for deletion or access or rectification whatever [it] might be, but especially with that deletion request, you have to make sure that every employee is aware of how to handle those requests,” he says.
The leaders who serve as the engines of change when it comes to effective data retention and deletion have their work cut out for them. “You don't get a magic wand as a CISO to be able to make everybody do what you want. You've got to influence them. You've got to explain where we're going, how we're getting there, how this is going to help,” Jason Rader, vice president and CISO at solutions and systems integration company Insight, tells InformationWeek.
Enterprises are going to continue amassing data. Regulations are going to evolve. The work here is never done.
“All of these processes that you put in place, the [policies] that you put in place, whether it's retention and disposition, whether it's data subject request responses, including responses to erasure requests, all of those things need auditing,” says Wall. “They need tracking to see where you can make continuous improvement.”
Cost and risk reduction are clear benefits of getting rid of unneeded data, but it can also give enterprises a competitive edge. “If you can limit that sensitive information and then have a regular purging process, deletion process, I think that's actually something that companies can market themselves on,” says Pierson.
AI and Data Proliferation
As enterprise leaders integrate AI into their operations, the perceived value of data is even higher. AI models are hungry for data, so why shouldn’t organizations stuff them with everything they’ve got?
AI is going to highlight that there is such a thing as too much data, particularly when it is bad data. “AI certainly amplifies and accelerates any privacy and security issues that you previously had,” says Simberkoff. “It can either be your best friend or your worst enemy, depending on what you do.”
For companies implementing retrieval-augmented generation (RAG), the quality of data they use is vital. RAG allows an enterprise to augment an existing LLM with its own information.
“You'll quickly learn that if you've got a bunch of garbage in that RAG repository, you're getting garbage out,” says Rader.
And if a company is feeding an AI system mountains of data -- some of it outdated, some of it duplicative, some of it completely unknown -- it is likely to discover the meaning of “garbage in, garbage out” in the form of poor results.
On the flip side, AI has the potential to help organizations tackle the problem of too much data. It can be the catalyst for enterprise teams to delete unneeded data. Rader shares that his organization recently deleted six of seven terabytes of data in Microsoft SharePoint. “AI … drove us to get rid of some of that data because it was out there and it was just kind of messing up our mojo for how we want to move forward,” he explains.
AI also has the potential to empower better data practices. “It drives good data governance, drives best practices. [It] drives the necessity for labeling, classifying, and putting good life cycle management around your information,” says Simberkoff.
The pressure is on for enterprises to answer the question of how much is too much data. From an innovation standpoint, enterprises need better data hygiene to realize the potential of AI. From a cybersecurity perspective, enterprises need to reduce the risk associated with safeguarding data. From a privacy perspective, enterprises need to understand their responsibilities to consumers and employees lest they run afoul of regulations.
“Companies, states, and the federal government have a chance and opportunity to really stand up and create more holistic, more omnibus privacy and cybersecurity laws, as well as in internal corporate policies, to protect consumer data in the United States,” says Pierson.
About the Author
You May Also Like