No one would ever accuse Netflix of being a tech laggard. Netflix prides itself on seeing the future before it arrives and getting a jump on the competition. So Netflix recently moved to Apache Cassandra, a NoSQL database, and Hadoop, a classic big data play.
But because Cassandra could not easily be married to Netflix's existing analytics and reporting platforms, Netflix discovered it needed to develop an offline process to extract all that big data. Otherwise, its shiny new database would become a data vault.
However, "We soon discovered that while [the offline process] may be feasible for one or two clusters, maintaining the number of moving parts required to deploy this solution to all of our production clusters was going to quickly become unmaintainable," Charles Smith, Netflix's senior software engineer for big data, and Jeff Magnusson, the company's manager of data science platform architecture, wrote on the Netflix Tech Blog.
[ How big data can save lives. Read Big Data Project Analyzes Veterans' Suicide Risk. ]
To solve the problem, Netflix engineers created an application to reduce the number of moving parts and increase the speed with which the data could be analyzed. It was not a trivial undertaking. It took time; it cost money. Now Netflix can scale in the cloud as the size of its data warehouse grows.
This is just one example of how a big data project can deliver unpleasant surprises downstream, and how difficult and expensive these challenges can be for organizations to overcome.
Look Before You Leap
Many companies are feeling competitive pressure to cope with fast-growing data varieties, volumes and velocities. They're making substantial investments to leverage the flood. But unless these investments are carefully planned, and the organizational impact of the computing changes considered, the business results likely will be disappointing.
As Netflix and other organizations have already discovered, moving to NoSQL platforms can result in vast amounts of information that ends up locked in data vaults, formatted in ways that cannot easily be queried or analyzed.
Fortunately, this problem can be avoided with high levels of communication among various organizational functions, and the setting of clear and broadly communicated business and technical requirements. Most importantly, all big data initiatives should begin by reaching out to all downstream information users.
Big Data, Big Risks
When a major communications provider switched from an older Oracle RDBMS to Apache Cassandra, it neglected to speak with downstream stakeholders who would be using the information collected. As a result, after the system was built and implemented, the company discovered that key information could not be queried. Again, a company had to build a highly customized solution, which required additional time and funding.
Todd Homa is a Data Architect at CapTech Consulting with over 17 years experience helping clients design and implement complex data solutions.
Harlan Bennett is a Senior Consultant at CapTech Consulting with over 10 years experience in business systems analysis, enterprise architecture, and strategy.