Cloud services changed the IT ops game, but performance has been more iffy than many people realized. Google's Cloud Platform got to learn from its predecessors.
In December, Google announced general availability of its infrastructure-as-a-service (IaaS) Google Compute Engine. Compute Engine is one of the three pillars of the Google Cloud Platform. The other two are App Engine (Google's platform-as-a-service) and cloud storage (Google's SQL, NoSQL, and object storage).
These pillars comprise a very large-scale compute and storage infrastructure for a diversity of scalable service offerings that Google developers make heavy use of internally and that Google makes available for public consumption via software as-a-service (SaaS) and well-defined programming interfaces (APIs).
However, I find the promise of "performance stability" most enticing. When public cloud computing first emerged we, the public, were happy to get our hands on compute and storage resources on demand, at low cost. Many of us also experienced great joy in our ability to avoid the burdens of traditional system procurement and IT processes. Over time, however, as cloud use has matured, the quality of resource performance, or its stability, has become increasingly important.
Cost of performance instability The performance that we experience from a particular cloud resource (CPU, I/O, latency, bandwidth) can vary over time for many reasons, including:
Resource sharing with best-effort isolation,
Performance of one resource (e.g., remote storage) is impacted by that of another, e.g., networking, hardware heterogeneity (similarly endowed devices can behave differently), and failures,
Bugs and inefficiencies in complex cloud software stacks and virtualization technologies, and
Placement, migration, fault tolerance, and adding and removing resources and virtual instances.
Inconsistent resource performance impacts cloud users negatively in multiple ways. First, fluctuations in performance can be significant (up to 5x) and limit users' ability to accurately measure, reproduce, and predict execution time and cost. This inability to predict cost, performance, and load can lead to ineffective scaling decisions, manual or automatic, and may preclude public cloud use for certain application domains.
Given that public clouds are opaque, measuring this variance as ordinary users in order to account for it, this is infeasible. Also, instability when experienced by parallel workloads, such as Map Reduce jobs, can have a compound effect on performance. One reason for this is the "straggler problem" in which similar jobs take longer than their counterparts, for no apparent reason.
The cost of instability that is most interesting to me, however, is the human effort by developers, dev-ops, and system administrators. The impact of performance fluctuation in IaaS has given users an incentive to modify their workloads to compensate for, mask, or otherwise avoid instability. Examples of such modifications include using multiple network-attached block storage devices, introducing complexities into distributed programming systems for straggler avoidance, such as killing off long-running jobs or executing multiple instances of the same job to see which one returns first, and gaming a cloud service's virtual instance placement.
This additional effort required by cloud users can be extremely costly both in terms of cloud resource use and human capital. Such work is unnecessary if the public cloud provider addresses even a subset of the causes behind performance instability themselves. Enter Google Compute Engine.
Compute Engine's promise of stability Although there is much debate in the blogosphere about whether Google has come to the IaaS table too late, I believe that by delaying it has a unique advantage over the competition: hindsight. Google has been able to observe the challenges and pitfalls of other IaaS offerings (not to mention to gain tremendous experience with warehouse-scale service computing) to identify an increasingly common pain point of IaaS users. Moreover, Google has used this experience to design from scratch an IaaS that other vendors can only retroactively bandage and attempt best-effort improvements.
As a result, Google Compute Engine is the next-generation of IaaS system and provides resource performance stability via a set of novel engineering advances. These advances include: customized virtualization under KVM; advanced resource isolation technologies, such as specialized Linux control groups that shield one process from others; clever data replication/redundancy strategies; novel datacenter design and geographic placement; and dedicated high-speed fiber networks between well thought out and proven software services, such as App Engine, Cloud Storage, BigQuery, and YouTube.
By focusing on developing a scalable, performance-stable IaaS system, Google has the opportunity to offer virtual machine instances with consistent resource performance at very low cost. By doing so, not only will users save on VM instance use, but they will also save the time and effort that they are using today to re-architect their virtual instances and cloud applications to overcome performance instabilities of other IaaS systems. If this promise of performance stability from Compute Engine comes to fruition, I believe that we will see the needle of IaaS market share move quickly in Google's direction.
InformationWeek Conference is an exclusive two-day event taking place at Interop where you will join fellow technology leaders and CIOs for a packed schedule with learning, information sharing, professional networking, and celebration. Come learn from each other and honor the nation's leading digital businesses at our InformationWeek Elite 100 Awards Ceremony and Gala. You can find out more information and register here. In Las Vegas, March 31 to April 1, 2014.
Chandra Krintz is a professor of computer science at the University of California at Santa Barbara and chief scientist of AppScale Systems Inc. AppScale is an open source cloud platform that is API-compatible with Google App Engine. She holds M.S. and Ph.D. degrees from ... View Full Bio
Multicloud Infrastructure & Application ManagementEnterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
InformationWeek Tech Digest, Nov. 10, 2014Just 30% of respondents to our new survey say their companies are very or extremely effective at identifying critical data and analyzing it to make decisions, down from 42% in 2013. What gives?