Beyond the Data Warehouse: What Are the Repository Options Today?
With the rise of unstructured big data, a new wave of data repositories has come into use that don’t always involve a data warehouse.
As the volume of worldwide data expands into hundreds of zettabytes, data management has become a dilemma for CIOs and companies, which now view data as a strategic asset.
To harness and manage data, IT is investing in data management tools and putting methodologies in place for importing, cleaning, and storing data. Central to this activity is determining how the data will be stored. The more IT can characterize storage for the type of data that it’s dealing with, the better IT will be able to manage the data.
With the rise of unstructured big data, which now comprises roughly 80% of all enterprise data under management, a new wave of data repositories has come into use that don’t always use a data warehouse. The new forms of data repositories have evolved because enterprise use of data has changed. This change has been a move away from structured data in neat, fixed record lengths to more unstructured data with no fixed record lengths at all.
Here is a breakdown of the data repository options that are in common use today:
1. Hierarchical and relational databases
Databases on mature enterprise platforms like mainframes continue to operate with hierarchical and relational database structures that are mature, robust, and proprietary. These databases work extraordinary well. They are supported by an army of software utilities that ensure data integrity, security, monitoring and access.
Enterprise CIOs keep these databases in place because the databases are proven and best of class. On the downside, it takes highly skilled personnel to run these databases, and IT budgets must support these salaries.
For the most part, proprietary databases contain structured system of record data, but they are also utilized in big data analytics because many of the keys and vectors into big data for analytics come from system of record systems.
2. Data lakes
Data lakes are different. Their purpose is to store, secure and avail access to aggregated combinations of structured and unstructured data that are tailored to a particular area of the business. An example is a marketing and customer demographics data lake that is used by marketing for purposes of developing a targeted product marketing campaign. Another example is a medical information system that combines records and documentation on patient visits with patient MRIs, X-rays, and CT scans.
The data lake is an enclosed repository of data that isn’t as immense as a hierarchical database, but that is nonetheless fed by tributaries of data that can come from a hierarchical database, or from an outside data source such as social media, or an internal, unstructured data source, such as image and video files.
The intent is to avail the data lake to a specific community of users, and to refresh the data lake periodically from its incoming data tributaries to ensure that data remains fresh and relevant. CIOs charge their organizations to ensure that the proper data practices are in place for each data lake that IT supports.
3. Data streams
While data lakes are stagnant pools of data that must be periodically refreshed by tributaries of incoming new data, data streams are quite the opposite. This is because the data in a data stream is continuously in motion, so it never gets old.
A good example is the IoT (Internet of Things) data that streams in from security cameras, robots, industrial equipment, drones, etc. Except for saving snapshot-in-time activity logs that are pertinent for system monitoring, debugging and security, most data stream data is transitory. It doesn't need to be stored long-term in a data repository, but it does require rapid point-to-point data transport for the business operations it supports, and IT must budget for that.
4. Data oceans
Data oceans are pools of vast, uncharted, and unprocessed data that flow from and into the entire enterprise. Companies store this data because they think they could have a use for it in the future. Unfortunately, there’s also a high risk that the data never gets used.
Since data ocean data has never been cleaned or processed, it is highly polluted, and unlikely to produce quality analytics. As the data ocean continues to expand, it costs more money to store, and it becomes more difficult to manage. The key for managing this data is determining how long you want to keep it? If it is a trove of emails, you might want to store it for purposes of legal discovery if the companies ever engaged in a lawsuit. If it's a bunch of IoT jitter, or data castoffs from old test systems, it’s best to discard it. In all cases, clear IT policies and practices should be in place to manage data oceans.
What to Read Next:
New Storage Trends Promise to Help Enterprises Handle a Data Avalanche
About the Author
You May Also Like