Google Cloud's Big Promise: Performance Stability
Cloud services changed the IT ops game, but performance has been more iffy than many people realized. Google's Cloud Platform got to learn from its predecessors.
In December, Google announced general availability of its infrastructure-as-a-service (IaaS) Google Compute Engine. Compute Engine is one of the three pillars of the Google Cloud Platform. The other two are App Engine (Google's platform-as-a-service) and cloud storage (Google's SQL, NoSQL, and object storage).
These pillars comprise a very large-scale compute and storage infrastructure for a diversity of scalable service offerings that Google developers make heavy use of internally and that Google makes available for public consumption via software as-a-service (SaaS) and well-defined programming interfaces (APIs).
Many in the blogosphere have discussed the price-performance and particular features of Compute Engine that set it apart from its competitors. Last March, GigaOm published a test drive of Compute Engine by Sebastian Stadil and his team at Scalr, a front-end cloud management firm, comparing Google Compute Engine to Amazon Web Services. Google's writes to disk were almost twice as fast as AWS's, the report said. Janakiram MSV, head of cloud infrastructure services at Aditi Technologies, published his take on "Ten Features That Make Google Compute Engine Better Than AWS," including 5x faster virtual machine boot times.
[Want to learn more about how Google positions Compute Engine? See Google Compute Cloud Challenges Amazon.]
However, I find the promise of "performance stability" most enticing. When public cloud computing first emerged we, the public, were happy to get our hands on compute and storage resources on demand, at low cost. Many of us also experienced great joy in our ability to avoid the burdens of traditional system procurement and IT processes. Over time, however, as cloud use has matured, the quality of resource performance, or its stability, has become increasingly important.
Cost of performance instability
The performance that we experience from a particular cloud resource (CPU, I/O, latency, bandwidth) can vary over time for many reasons, including:
Resource sharing with best-effort isolation,
Performance of one resource (e.g., remote storage) is impacted by that of another, e.g., networking, hardware heterogeneity (similarly endowed devices can behave differently), and failures,
Bugs and inefficiencies in complex cloud software stacks and virtualization technologies, and
Placement, migration, fault tolerance, and adding and removing resources and virtual instances.
Inconsistent resource performance impacts cloud users negatively in multiple ways. First, fluctuations in performance can be significant (up to 5x) and limit users' ability to accurately measure, reproduce, and predict execution time and cost. This inability to predict cost, performance, and load can lead to ineffective scaling decisions, manual or automatic, and may preclude public cloud use for certain application domains.
Given that public clouds are opaque, measuring this variance as ordinary users in order to account for it, this is infeasible. Also, instability when experienced by parallel workloads, such as Map Reduce jobs, can have a compound effect on performance. One reason for this is the "straggler problem" in which similar jobs take longer than their counterparts, for no apparent reason.
The cost of instability that is most interesting to me, however, is the human effort by developers, dev-ops, and system administrators. The impact of performance fluctuation in IaaS has given users an incentive to modify their workloads to compensate for, mask, or otherwise avoid instability. Examples of such modifications include using multiple network-attached block storage devices, introducing complexities into distributed programming systems for straggler avoidance, such as killing off long-running jobs or executing multiple instances of the same job to see which one returns first, and gaming a cloud service's virtual instance placement.
This additional effort required by cloud users can be extremely costly both in terms of cloud resource use and human capital. Such work is unnecessary if the public cloud provider addresses even a subset of the causes behind performance instability themselves. Enter Google Compute Engine.
Compute Engine's promise of stability
Although there is much debate in the blogosphere about whether Google has come to the IaaS table too late, I believe that by delaying it has a unique advantage over the competition: hindsight. Google has been able to observe the challenges and pitfalls of other IaaS offerings (not to mention to gain tremendous experience with warehouse-scale service computing) to identify an increasingly common pain point of IaaS users. Moreover, Google has used this experience to design from scratch an IaaS that other vendors can only retroactively bandage and attempt best-effort improvements.
As a result, Google Compute Engine is the next-generation of IaaS system and provides resource performance stability via a set of novel engineering advances. These advances include: customized virtualization under KVM; advanced resource isolation technologies, such as specialized Linux control groups that shield one process from others; clever data replication/redundancy strategies; novel datacenter design and geographic placement; and dedicated high-speed fiber networks between well thought out and proven software services, such as App Engine, Cloud Storage, BigQuery, and YouTube.
By focusing on developing a scalable, performance-stable IaaS system, Google has the opportunity to offer virtual machine instances with consistent resource performance at very low cost. By doing so, not only will users save on VM instance use, but they will also save the time and effort that they are using today to re-architect their virtual instances and cloud applications to overcome performance instabilities of other IaaS systems. If this promise of performance stability from Compute Engine comes to fruition, I believe that we will see the needle of IaaS market share move quickly in Google's direction.
InformationWeek Conference is an exclusive two-day event taking place at Interop where you will join fellow technology leaders and CIOs for a packed schedule with learning, information sharing, professional networking, and celebration. Come learn from each other and honor the nation's leading digital businesses at our InformationWeek Elite 100 Awards Ceremony and Gala. You can find out more information and register here. In Las Vegas, March 31 to April 1, 2014.
About the Author
You May Also Like