According to analysts, the cloud revolution is well underway. Synergy Research says cloud services are eating into on-premise technology growth. Forrester says cloud computing is “coming of age as the foundation for enterprise digital transformation” in its Predictions 2019: Cloud Computing report.
However, while companies have spent the last few years shifting a wide variety of IT components to the cloud, they have been much slower to move big data services away from their internal infrastructures. Early adopters of Hadoop and other large-scale data analytics technologies had to keep things in-house because these were essentially still experimental technologies.
Now, those companies that are starting their analytics forays are finding that Hadoop is simply too damn hard, and cloud vendors have come a long way with their data services. Take it all together and companies are finding that the cloud better suits their big data needs for the following reasons:
The physical implementation of a cluster is too much effort
Why buy a cluster of servers when you can have AWS or Azure and spin up a bunch of them for you? As is the case with all cloud services, you don’t have to order the hardware, nor do you need to power or even cable them up. Most of the time just constructing the physical environment alone is hard enough, not to mention getting the actual software up and running.
This is a major problem that continues to plague big data. Cloud vendors are continually chipping away at big data’s ease-of-use problem by providing more automation. With the ability to automatically spin massive computing clusters up and down, cloud services suppliers are significantly reducing the need for people who have deep expertise in running them, which is important because these specialists remain hard to find.
One huge advantage of the cloud, especially for big data implementations, is that they dramatically mitigate risk. You don’t know up front if your data will contain great revelations. But with cloud vendors, you can spin up a cluster, do some work and then spin it back down if you can’t unearth insights of any value, all without incurring much overall project risk. Better yet, if you do find something potentially game-changing in your data, you can then quickly spin up more systems to scale your project without spending time and money purchasing and implementing systems and software.
Of course, scaling up and down does not work for all use cases. Sometimes you must ramp up systems in the cloud and keep them running due to the nature of the project or the data. Nonetheless, it’s a lot easier to get that done in the cloud, which contributes greatly to risk reduction.
Incremental cost vs. big up-front investments
Directly related to the risk point above is the associated cost. Big data–related cloud deployments allow consumers to pay only for the services they use. The good news is that if your experimental project yields little value, your losses will be reduced significantly, assuming you fail fast. By contrast, your initiative will be an expensive failure if you were to buy all the equipment only to see your project get shut down.
The elasticity of the cloud allows faster time to insight. When you build a physical cluster, you are limited in how much processing you can do. A massive analytics job could take 10 hours on a 100-node cluster. With the cloud, for the same price, you can spin 1,000 nodes to run your job in an hour.
Elasticity is also key to helping organizations share massive data sets. Moving large data sets around is always a challenge. Even sharing them within an organization can be problematic because adding new users introduces load on a system. For example, if business unit A wants access to business unit B’s data, there might not be enough compute power to support more users. When the data is sitting in the cloud, it’s much easier to add capacity without having to duplicate the data. (Even if data needs to be duplicated, that process can happen quickly and easily in the cloud.)
Big data may have been late to the party, but the marketplace is finally flush with analytics-specific services that deliver on the cloud’s promise of reduced cost and complexity, and greater agility.
Alex Gorelik is author of O’Reilly Media's “The Enterprise Big Data Lake: Delivering the Promise of Big Data and Data Science”, and the founder and CTO of data cataloging company Waterline Data. Prior to Waterline Data, Gorelik served as senior vice president and general manager of Informatica’s Data Quality Business Unit, driving R&D, product marketing and product management for an $80 million business. He joined Informatica from IBM, where he was an IBM Distinguished Engineer for the Infosphere team. IBM acquired Gorelik’s second startup, Exeros (now Infosphere Discovery), where he was founder, CTO and vice president of engineering. Previously, he was cofounder, CTO and vice president of engineering at Acta Technology, a pioneering ETL and EII company, which was subsequently acquired by Business Objects.