Cloud // Infrastructure as a Service
Commentary
2/6/2014
09:06 AM
Chandra Krintz
Chandra Krintz
Commentary
Connect Directly
Twitter
LinkedIn
RSS
E-Mail
50%
50%

Google Cloud's Big Promise: Performance Stability

Cloud services changed the IT ops game, but performance has been more iffy than many people realized. Google's Cloud Platform got to learn from its predecessors.

In December, Google announced general availability of its infrastructure-as-a-service (IaaS) Google Compute Engine. Compute Engine is one of the three pillars of the Google Cloud Platform. The other two are App Engine (Google's platform-as-a-service) and cloud storage (Google's SQL, NoSQL, and object storage).

These pillars comprise a very large-scale compute and storage infrastructure for a diversity of scalable service offerings that Google developers make heavy use of internally and that Google makes available for public consumption via software as-a-service (SaaS) and well-defined programming interfaces (APIs).

Many in the blogosphere have discussed the price-performance and particular features of Compute Engine that set it apart from its competitors. Last March, GigaOm published a test drive of Compute Engine by Sebastian Stadil and his team at Scalr, a front-end cloud management firm, comparing Google Compute Engine to Amazon Web Services. Google's writes to disk were almost twice as fast as AWS's, the report said. Janakiram MSV, head of cloud infrastructure services at Aditi Technologies, published his take on "Ten Features That Make Google Compute Engine Better Than AWS," including 5x faster virtual machine boot times.

[Want to learn more about how Google positions Compute Engine? See Google Compute Cloud Challenges Amazon.]

However, I find the promise of "performance stability" most enticing. When public cloud computing first emerged we, the public, were happy to get our hands on compute and storage resources on demand, at low cost. Many of us also experienced great joy in our ability to avoid the burdens of traditional system procurement and IT processes. Over time, however, as cloud use has matured, the quality of resource performance, or its stability, has become increasingly important.

Cost of performance instability
The performance that we experience from a particular cloud resource (CPU, I/O, latency, bandwidth) can vary over time for many reasons, including:

  • Resource sharing with best-effort isolation,
  • Performance of one resource (e.g., remote storage) is impacted by that of another, e.g., networking, hardware heterogeneity (similarly endowed devices can behave differently), and failures,
  • Bugs and inefficiencies in complex cloud software stacks and virtualization technologies, and
  • Placement, migration, fault tolerance, and adding and removing resources and virtual instances.

Inconsistent resource performance impacts cloud users negatively in multiple ways. First, fluctuations in performance can be significant (up to 5x) and limit users' ability to accurately measure, reproduce, and predict execution time and cost. This inability to predict cost, performance, and load can lead to ineffective scaling decisions, manual or automatic, and may preclude public cloud use for certain application domains.

Given that public clouds are opaque, measuring this variance as ordinary users in order to account for it, this is infeasible. Also, instability when experienced by parallel workloads, such as Map Reduce jobs, can have a compound effect on performance. One reason for this is the "straggler problem" in which similar jobs take longer than their counterparts, for no apparent reason.

The cost of instability that is most interesting to me, however, is the human effort by developers, dev-ops, and system administrators. The impact of performance fluctuation in IaaS has given users an incentive to modify their workloads to compensate for, mask, or otherwise avoid instability. Examples of such modifications include using multiple network-attached block storage devices, introducing complexities into distributed programming systems for straggler avoidance, such as killing off long-running jobs or executing multiple instances of the same job to see which one returns first, and gaming a cloud service's virtual instance placement.

This additional effort required by cloud users can be extremely costly both in terms of cloud resource use and human capital. Such work is unnecessary if the public cloud provider addresses even a subset of the causes behind performance instability themselves. Enter Google Compute Engine.

Compute Engine's promise of stability
Although there is much debate in the blogosphere about whether Google has come to the IaaS table too late, I believe that by delaying it has a unique advantage over the competition: hindsight. Google has been able to observe the challenges and pitfalls of other IaaS offerings (not to mention to gain tremendous experience with warehouse-scale service computing) to identify an increasingly common pain point of IaaS users. Moreover, Google has used this experience to design from scratch an IaaS that other vendors can only retroactively bandage and attempt best-effort improvements.

As a result, Google Compute Engine is the next-generation of IaaS system and provides resource performance stability via a set of novel engineering advances. These advances include: customized virtualization under KVM; advanced resource isolation technologies, such as specialized Linux control groups that shield one process from others; clever data replication/redundancy strategies; novel datacenter design and geographic placement; and dedicated high-speed fiber networks between well thought out and proven software services, such as App Engine, Cloud Storage, BigQuery, and YouTube.

By focusing on developing a scalable, performance-stable IaaS system, Google has the opportunity to offer virtual machine instances with consistent resource performance at very low cost. By doing so, not only will users save on VM instance use, but they will also save the time and effort that they are using today to re-architect their virtual instances and cloud applications to overcome performance instabilities of other IaaS systems. If this promise of performance stability from Compute Engine comes to fruition, I believe that we will see the needle of IaaS market share move quickly in Google's direction.

InformationWeek Conference is an exclusive two-day event taking place at Interop where you will join fellow technology leaders and CIOs for a packed schedule with learning, information sharing, professional networking, and celebration. Come learn from each other and honor the nation's leading digital businesses at our InformationWeek Elite 100 Awards Ceremony and Gala. You can find out more information and register here. In Las Vegas, March 31 to April 1, 2014.

Chandra Krintz is a professor of computer science at the University of California at Santa Barbara and chief scientist of AppScale Systems Inc. AppScale is an open source cloud platform that is API-compatible with Google App Engine. She holds M.S. and Ph.D. degrees from ... View Full Bio

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
Stratustician
50%
50%
Stratustician,
User Rank: Ninja
2/12/2014 | 11:43:07 AM
You're still at the mercy of the Admins
Bless you Google for trying to solve one of the biggest issues with cloud stability, but the reality is that since you are still relying on the skillsets of the admin folks running the processes, it's really hard to control stability.  Throw in the many variables from connectivity to location, to server loads, not to mention the process types themselves, can we really assume that Google can control so much as to affect the stability of an entire environment?
ckrintz
50%
50%
ckrintz,
User Rank: Apprentice
2/7/2014 | 12:39:30 PM
Re: Cloud Heroes
Great point -- and something we should all consider carefully: i.e., where we are putting our engineering effort in our on-going pursuit of "productivity gains".
ckrintz
50%
50%
ckrintz,
User Rank: Apprentice
2/7/2014 | 12:30:25 PM
Re: Biggest wild card
Yes, you are right!  Locality is an important factor and the Net Neutrality discussion will play a key role here (as will the argument for on-premises implementations of popular IaaS systems/APIs).  What I am describing in this article is the performance variance within the datacenter.  If we can gain access to more stable systems inside the IaaS, it will be a big step in the right direction!
Charlie Babcock
50%
50%
Charlie Babcock,
User Rank: Author
2/6/2014 | 12:52:54 PM
Netflix adopted work arounds to performance issues
Performance is the sleeper in evaluating cloud services. It's hard to compare pricing schemes, perhaps harder to compare performance. Netflix certainly did so and was surprised at the variations it found on Amazon EC2. It then figured out work arounds to avoid them. But not everyone is Netflix.
Lorna Garey
50%
50%
Lorna Garey,
User Rank: Author
2/6/2014 | 10:56:35 AM
Biggest wild card
Smart geographic distribution of datacenters and caching schemes aside, what about the connectivity wild card? Unless a customer buys dedicated WAN bandwidth, what's Google's plan to smooth out the bumps of delivering service over the public Internet? And, is it watching the Net neutrality debate with some trepidation?
Laurianne
50%
50%
Laurianne,
User Rank: Author
2/6/2014 | 9:23:30 AM
Cloud Heroes
"The impact of performance fluctuation in IaaS has given users an incentive to modify their workloads to compensate for, mask, or otherwise avoid instability." Interesting. This feeds into a discussion we've been having this week about IT's hero complex. You may be going to great lengths to modify the app processing workload -- which really isn't creating economic efficiency for your company.
Multicloud Infrastructure & Application Management
Multicloud Infrastructure & Application Management
Enterprise cloud adoption has evolved to the point where hybrid public/private cloud designs and use of multiple providers is common. Who among us has mastered provisioning resources in different clouds; allocating the right resources to each application; assigning applications to the "best" cloud provider based on performance or reliability requirements.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest September 18, 2014
Enterprise social network success starts and ends with integration. Here's how to finally make collaboration click.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
The weekly wrap-up of the top stories from InformationWeek.com this week.
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.