Cloud // Platform as a Service
Commentary
3/25/2013
02:00 PM
Connect Directly
LinkedIn
Google+
Twitter
RSS
E-Mail
50%
50%

How Netflix Is Ruining Cloud Computing

A laser focus on Amazon Web Services and seeming disregard for next-gen best practices could spell lock-in, and derail real IaaS competition.

On March 13, Netflix announced $100,000 in prize money for the developers who do the most to improve its open source tools for controlling and managing application deployments on cloud computing. Before spearheading this contest, Netflix's cloud architect, Adrian Cockcroft, released many internal Netflix tools as open source. Currently, 8 cloud-architecture-specific tools are available from Netflix, and Cockcroft has been very open in sharing his and Netflix's knowledge in public forms.

In theory, all of this should be wonderful. In reality, however, it's likely to leave cloud computing with an enormous hangover of subpar practices and architectures for years to come. Netflix is the poster child for "Cloud Computing v1.0" and demonstrates both the enormous benefits and troubling problems. Cloud Computing v1.0 is a strictly an Amazon Web Services affair -- it was first, and no other provider had the core features necessary to build comparable applications (think multiple availability zones and EBS with snapshots and quick restores). So it makes sense that Netflix embraced AWS; it saw huge benefits in being able to deploy and scale its service using the interfaces and architectures that were possible when AWS launched.

But Netflix has also suffered repeatedly at the hands of Cloud Computing v1.0 with four outages in 2012 alone, which certainly points to the possibility for some improvement in the high availability of its service. Of note, the Christmas Eve outage is perhaps most troubling from a "v1.0" perspective, as it was solely the result of Netflix's reliance on a less-necessary AWS service for load balancing, which could have been handled in any number of other ways to increase server availability.

[ Check out our new InformationWeek cloud computing comparison of 13 top PaaS vendors: Cloud Computing Comparison: PaaS Providers. ]

The reason the Netflix contest is likely to leave organizations worse off is because it thoroughly embraces this "Cloud Computing v1.0" mindset, both from an "AWS-is-the-only-vendor" standpoint as well as from an architectural standpoint. While it's arguable that there still isn't (quite yet) another infrastructure-as-a-service (IaaS) vendor that has a thoroughly tested core feature set, unless you just walked out of the tattoo parlor with "#AWS" on your shoulder, you know it won't be long. And all companies running on AWS should be looking forward to the rise of additional IaaS vendors, like those in our IaaS buyer's guide, for two reasons: higher availability and price competition.

Every cloud architect should know that it's only a matter of time before organizations have applications deployed across the world on many different IaaS providers in many different data centers, based on request volume and location in combination with a market for computing resources that changes price constantly. Locking yourself down to AWS today, for greenfield cloud architectures, would be the equivalent of deciding to develop an iPhone-only application when you know you'll have to support iPads, Android and others in the future.

In addition to the annoying AWS-centrism of the Netflix contest, there's a deeper problem: Some of Netflix's tools embrace a cloud architecture that was fine in the days of Cloud Computing v1.0 but that will look increasingly suspect as time goes on. I know that it's hard to throw out code and systems that are working fine, especially when they still look pretty good -- and often, squeaking out a bit more time is the right internal decision for an individual company. But instead of just wringing out the last bits of value, Netflix is throwing significant money at the rest of the world, asking everyone to embrace and extend their tools and code that are not particularly good practices for future cloud architectures.

Perhaps the best example of a bad-practice Netflix tool is Aminator. Aminator helps you build Amazon Machine Images (AMIs) easily, based on a "base" AMI and a package of code. "I must have produced about 25,000 Ubuntu AMIs," raved one excited early user. There's just one problem: It's hard to understand when this would ever be a good idea. Several years ago, spawning tons of images would have been a somewhat acceptable way to roll out a revised version of an application (due to application code, operating system, and/or server software). But today we have widespread adoption of configuration management tools like Chef and Puppet that make massive AMI creation a subpar practice at best. Amazon Web Services itself recently rolled out a service called OpsWorks, which would be a significantly better way to handle deploying applications -- it uses Chef.

There are other less-bad tools, but many bear the mark of having to architect around a number of issues that have since been more or less resolved; it's a bit like an open source project that relies heavily on SOAP instead of being RESTful. For example, Edda, which figures out what cloud resources you're using at AWS, just seems like something that had to be built because no one properly set up how resources should be requested and deployed. And Asgard, a very cool tool from 2010 for managing a variety of different applications across AWS, would be a hard sell as a best-of-breed tool today compared with other open source options, notably Scalr and Chef.

This is not to say that all of Netflix's open source cloud tools fit into this mold. Denominator is a great DNS manager (because it's multi-cloud), and Simian Army is a fabulous, ground-breaking idea for testing cloud architectures (it is, unfortunately, AWS-only).

There's a possibility that the Netflix contest will help lead the world toward Cloud Computing v2.0 and beyond by embracing multi-cloud architectures that use orchestration and configuration management in optimal ways. However, I am skeptical on both fronts. Cockcroft's public comments suggest little interest in using other cloud vendors. A good chunk of the prize money is in AWS credits, and Amazon's CTO is a judge; all this points to a very AWS-centric mindset. Moreover, the fact that Netflix just released Aminator last week indicates to me that Netflix is happy to roll out whatever tools they've built, regardless of whether they fit in with a best-practices modern cloud architecture.

But please, Netflix, prove me wrong. Embrace a less proprietary, more highly available, more standardized cloud -- and put Google's Urs Hölzle on the panel while you're at it. #UrsForNetflixJudge

Cloud Connect returns to Silicon Valley, April 2-5, 2013, for four days of lectures, panels, tutorials and roundtable discussions on a comprehensive selection of cloud topics taught by leading industry experts. Join us in Silicon Valley to see new products, keep up-to-date on industry trends and create and strengthen professional relationships. Use Priority Code MPIWK by March 30 to save an extra $200 off the advance price of Conference Passes. Register for Cloud Connect now.

Comment  | 
Print  | 
More Insights
Comments
Newest First  |  Oldest First  |  Threaded View
<<   <   Page 3 / 5   >   >>
adrianco
50%
50%
adrianco,
User Rank: Apprentice
3/27/2013 | 5:02:38 PM
re: How Netflix Is Ruining Cloud Computing
Both ways of creating images are valid and tooling should be used to automate every step of the build process. There should be no hand crafted images. Every image should be traceable to the bits it was built from. That's the best practice of cloud.

The chef at runtime approach works fine at small scale but breaks horribly at large scale. When you have lots of developers changing things at once you want to build using the latest bits and freeze that build for test and deployment. For availability you would need multiple distributed Chef servers, but you then have to guarantee that they are always in sync, which is one of the hard problems of distributed computing. Avoiding that problem has value.

Baking AMIs is wasting a cheap resource, and we have tooling to clean up the leftovers. These are implementation choices that should be made appropriately to the situation. The NetflixOSS PaaS is interesting to many enterprises who do have large scale problems, and who find that other PaaS solutions are currently optimized for startups and small scale applications.

If you only have 10s of instances, NetflixOSS is likely to be overkill. If you have 100s it becomes useful, with 1000s it's essential and 10,000s it's probably the only game in town at the moment. With 100,000s you are Facebook or Google anyway...
gregdek
50%
50%
gregdek,
User Rank: Apprentice
3/27/2013 | 3:33:23 PM
re: How Netflix Is Ruining Cloud Computing
"Unfortunately, the only company with authorized rights to the AWS API (other than AWS) is Eucalyptus, so what you make sound so easy in your response is so fraught from a legal perspective that you're not going to find any other provider doing it."

Except, of course, that other providers *are* doing it right now, and have been doing it for years. Cloudstack has Cloudbridge. Red Hat has Deltacloud/Aeolus. And it's all open source. Sure, we're moving faster down this path at Eucalyptus -- but it's the exact same path.
jemison288
50%
50%
jemison288,
User Rank: Moderator
3/27/2013 | 10:23:33 AM
re: How Netflix Is Ruining Cloud Computing
Thanks for the response. Unfortunately, the only company with authorized rights to the AWS API (other than AWS) is Eucalyptus, so what you make sound so easy in your response is so fraught from a legal perspective that you're not going to find any other provider doing it. Perhaps if AWS were less proprietary and more willing to contribute to the overall community, they would allow providers to implement it as their own APIs. Perhaps you could help put that pressure on them? Because once the AWS API can be used as an open standard, then Netflix's tools will instantly have a much bigger audience.

On AMIs: Your assertion that using Chef/Puppet for every launch "is not a good idea" assumes that you've got a lot of VM launches that will be identical (machine, cloud). Again, don't look at this from "what does Netflix do internally"; look at it from "general enterprise cloud adoption". By using an AMI-centric model of the world, you're (a) adding overhead to each release, (b) creating a management/cleanup/storage situation that you would not have otherwise, and (c) requiring yourself to treat launches in different regions/clouds differently--including verifying that you have the right images in the right places properly baked and ready to go. In contrast, using Chef/Puppet on every launch avoids every single one of those problems, and thus gives you much more flexibility. The cost of Chef/Puppet on each launch is that it adds (d) overhead (time, bandwidth) and (e) some additional level of fragility (how much depending upon where you're pulling files).

Your assertion that using Chef/Puppet on each launch is "not a good idea" shows how Netflix-centric your world is. For many, many people, Chef/Puppet on every launch is a much better business and technological decision than rolling AMIs for each release because the pain of (a), (b), and (c) is greater than the pain of (d) and (e). In fact, the fragility of (a) + (b) + (c) can be significantly greater than the fragility of (e).

Ultimately, this is not a referendum on "how Netflix should run its cloud architecture". This is a referendum on whether Netflix should have a responsibility to the cloud computing world to help novices understand best practices in running clouds versus running a contest that is more likely to promote sub-par use of the cloud.
jemison288
50%
50%
jemison288,
User Rank: Moderator
3/27/2013 | 10:06:50 AM
re: How Netflix Is Ruining Cloud Computing
Again, the issue is not about whether Netflix's current business decisions work for Netflix; the issue is about whether Netflix's tools and contest are beneficial for the enterprise that is considering how to move to the cloud. Running "at scale" for Netflix is thoroughly unlike running "at scale" for the vast majority of enterprise cloud need.

I'm not advocating a "lowest common denominator" as much as I'm advocating a fundamental set of best practices that one should master before getting into questions like, "how to launch 5,000 VMs in six continents within 10 minutes." If someone came to you wanting to know good software development practices, wouldn't you want to start with the basics of using code repositories, code review, style guides, and a discussion of waterfall v. agile? Before you started talking about how to manage a team of 500 developers? Yet on the cloud side, you act like it's unimportant whether Netflix does the former. As someone who would like to see much more enterprise cloud adoption, I see it as very important.
Joe Sondow
50%
50%
Joe Sondow,
User Rank: Apprentice
3/27/2013 | 5:43:23 AM
re: How Netflix Is Ruining Cloud Computing
Auto Scaling Groups.

I'll put aside the inflammatory, hyperbolic headline of the editorial for a moment, and talk about Auto Scaling Groups. Let's see how many times I can mention Auto Scaling Groups. Somebody count for me please.

At the core of Asgard's functionality is the Auto Scaling Group.

When Eucalyptus asked what they need to do in order to run Asgard against a Eucalyptus server, I told them they need to implement Auto Scaling Groups, and stub out a few other unimportant Amazon services Asgard currently expects to call. A few months later, they came back and said they were done. I asked if they implemented scaling policies. Yep. CloudWatch metrics? Yessir. Scheduled actions? You bet. Great! Let's finish making this thing flexible enough to use a Eucalyptus server. Someone still needs to add configurability to Asgard for regions, endpoints, instance types, application provider, and cloud API authentication. Cloud prize, anyone?

When OpenStack support consultants ask me how they can run Asgard against OpenStack, I tell them that first OpenStack needs to support the concepts that make Asgard useful, specifically Auto Scaling Groups. If you want to use Asgard without Amazon and without a cloud that has Auto Scaling Groups, then I really have to ask why. That's like using a food processor to open an envelope; you might get it to work, but to what end? There's maybe one screen in Asgard that might be useful for launching an instance without an Auto Scaling Group, but we don't use that screen much. Instead, I recommend choosing some implementation of Auto Scaling Groups, either through Scalr, Amazon, Eucalyptus or RightScale. The Auto Scaling Group serves to name and version a cluster, while associating it with an owner, and guaranteeing that the instances are homogeneous. The important part is the named group of instances of a single immutable image. The dynamic scaling part is gravy, although it does save you a lot of money.

As a partial substitute for the AWS Console, Asgard serves seven purposes for corporate Amazon customers, listed on the Netflix tech blog post where I first announced Asgard. (Google asgard tech blog). The purposes are: (1) Hide the Amazon keys, (2) Auto Scaling Groups, (3) Enforce conventions, (4) Logging, (5) Integrate systems, (6) Automate workflow, (7) Simplify REST API. When and if Amazon adequately addresses all seven of those issues in their own console, then I will gleefully recommend that Netflix deprecate Asgard and start using the AWS console instead. Then I'll go write some movie-related software instead. However, I'm not holding my breath. Amazon has a lot of other things to consider beyond supporting the cloud model Netflix has chosen. My prediction is that Asgard will remain a reasonable option for customers of cloud providers that have Auto Scaling Groups, starting with Amazon.

Is the publicity of Asgard putting pressure on cloud providers to implement both Auto Scaling Groups and usable graphic interfaces for configuring those Auto Scaling Groups? I hope so. That's one of the reasons I wanted to open source Asgard. If nobody can figure out how to use Auto Scaling Groups, then no one will use them. Then Amazon is less likely to add them to their console and less likely to augment them to be more useful, and Google is less likely to implement them. Auto Scaling Groups are great. Let's use them. Let's tell more cloud providers to provide them.

Will another company do as Eucalyptus did, and clone enough parts of the Amazon API to get free benefit from our tools? That would be good. Remember, Eucalyptus did most of that work before Amazon even talked to them. If cross-cloud-provider portability is your focus, my advice would be to add to Eucalyptus' open source implementation and make it plug into a dozen other cloud vendors the way it plugs into any data center. Personally I'm more interested in using so many isolated AWS regions that I don't need to worry about any one AWS system having a problem.

Now, let's talk a little more about AMIs.

Relying on a Chef/Puppet configurator for every production instance launch is not a good idea. It's a really bad idea. I don't why anyone would regard deploy-time configuration as something new and good, while regarding pre-baked image launching as something old and bad. It's the other way around. You might be used to the idea of deploy-time configuration, but it's still a bad idea. It's an unnecessary risk. The point of Aminator is to give people a robust way to stop thinking in that old school way. I want people to start using Chef at build time, not deploy time. Use Chef with Aminator to create a complete image of the latest version of your application. Then know with certainty that every instance of that AMI will be identical in the development, test, staging, and production environments, in multiple redundant regions across four continents, even if the network fails during instance startup, even if the Chef server is getting upgraded or is falling over one day, even if a second deployment of the image happens months later. All the instances will be homogeneous within an Auto Scaling Group, all the time, even at large scale.

For the past 9 months, Aminator was the missing piece in the story of Asgard's ease of use. Now that there is a convenient way to produce a new AMI for each software build, it should be easier for people to use Asgard and Auto Scaling Groups for deployments without needing to rely on a highly available production deploy-time Chef server. If these resiliency concepts can be offered by more cloud providers, so much the better. I don't think that's ruining the cloud. I think that's promoting good patterns for tomorrow's cloud.
bmoyles
50%
50%
bmoyles,
User Rank: Apprentice
3/27/2013 | 3:27:20 AM
re: How Netflix Is Ruining Cloud Computing
This is bananas...

"And unfortunately, your first happy user of AMInator (on Twitter, at least) made over 25,000 Ubuntu AMIs with it--can you tell me why that would ever be a good architectural decision? AMInator strikes me as a tool like PHP or a GOTO statement--there are places where you should probably use them, but it's hard to argue that they should be part of any kind of "best practices" decision."

No, that was me. Not an aminator user, but one of the aminator *authors*. Feel free to verify that both my Twitter and Github accounts align and feel free to observe aminator's commit history. Heck, look at my Twitter profile, where it is very clear that I...work for Netflix. If you took *anyone's* offhand comment about creating *twenty five THOUSAND* AMIs (let alone one of the people working on the tooling to do so), all with the application 'cowsay' as their primary component as being anything other than a joke, I don't know what any of us can do for you to help you understand motivations or intentions (unless you are aware of any large-scale talking ASCII cow clusters, in which case I stand corrected).

"One of the reasons why Netflix is now choosing Python is because the generalized Python developer writes consistent and good code. (We chose Python for the same reasons you did). But to someone who has no idea what a good cloud deployment looks like, the way AMInator sits out there--you're going to see a lot more people like the guy super-psyched to have built 25,000 AMIs over Twitter."

We do not choose technologies based on what prospective developers *might* do with it, like write "consistent and good" code (and the assertion that Python, by some magical virtue, makes good programmers is hogwash. Good developers write consistent and good code regardless of the language, and many people write bad Python with ease.) We choose technologies that fit the job and the situation. Given that a) aminator was intended to be run ad-hoc or from some other automation, b) there is a fair amount of Python experience amongst Netflix employees, and c) languages such as Python (and Ruby and Perl and ...) lend themselves to rapid iterative development, we felt it was the right tool for the job. While I personally enjoy using Python, there is nothing about aminator that couldn't have been done with Ruby, Perl, TCL, PHP, Java, Groovy, Scala, or heck, bash (which is what aminator's ancestor was developed with).

Aminator was built to be modular, and any of its 5 major components (at this point) can be replaced to work with any system you can conceive of. There's no reason a set of plugins couldn't be developed that produced images for Windows on Azure, or local disk images for use with VirtualBox. What we provided was a framework and *our implementation* which *naturally* services our needs. Folks are free to use it as-is, or they can take what we have and replace parts with what works for them. I really hope they do, too. I'd love to see it produce Windows images, FreeBSD images, and so on.

More documentation on how to use and extend aminator is on its way, and Netflix staff is in #netflixoss on irc.freenode.net fielding questions as they come in. You too are welcome to join and ask questions before posting articles, in case that wasn't clear :)
cbabcock
50%
50%
cbabcock,
User Rank: Strategist
3/27/2013 | 1:26:58 AM
re: How Netflix Is Ruining Cloud Computing
I agree with Joe Emison when he upholds cross cloud mobility and multi-cloud tools as the ultimate goal. I agree with Adrian Cockcroft when he pursues innovation and fresh ideas for the AWS context... in which Netflix currently and for the foreseeable future operates. Each has a different purpose behind his argument, and so their points are to some extent sailing past each other without registering and certainly without scoring, I don't like to see too much judgmental-ness applied to other people's architecture. The judgement should be applied to our own efforts and let the other fellow pursue his initiative to the max, even if it's initially seemed to fail to meet one or more of our ultimate, far sighted standards. The cloud is young and it's impossible to say how some seemingly small or narrow-minded effort might grow legs and lead to long term gains for everybody. With that said, both sides have presented a case well and I've learned from this debate. Charlie Babcock, senior writer, InformationWeek
adrianco
50%
50%
adrianco,
User Rank: Apprentice
3/27/2013 | 12:41:59 AM
re: How Netflix Is Ruining Cloud Computing
Correction: The only DynamoDB support in NetflixOSS is contributed code by someone who was not working at Netflix at the time (a former employee). Netflix mostly uses Cassandra.
gregdek
50%
50%
gregdek,
User Rank: Apprentice
3/27/2013 | 12:25:32 AM
re: How Netflix Is Ruining Cloud Computing
"The fact that only one out of ten prizes involves portability, and the fact that you take such an expansive view of portability to include adding language support to an existing tool (which has NOTHING to do with cloud portability!), shows that you really think that cloud portability unimportant to Netflix."

The fact that Adrian encourages us at Eucalyptus and our friends at Cloudstack to tackle the portability problem head-on shows otherwise.
adrianco
50%
50%
adrianco,
User Rank: Apprentice
3/27/2013 | 12:11:57 AM
re: How Netflix Is Ruining Cloud Computing
You're using some emotive language here: "as long as you continue to force Netflix to use new and expanded Amazon-provided services over other options".

What actually happens is that Netflix has various problems to solve, and we do the usual make/buy evaluations, and sometimes we make it ourselves (e.g. Asgard over the AWS Console or other options) and sometimes we get vendors to build them. Part of that process is to work with AWS to build things we would like to use, but are of general use, so we don't want to build them ourselves. Part of the reason for releasing NetflixOSS is to make explicit to other cloud vendors the feature set and options that we have found useful, to see if they are also interested in responding.

You said "Perhaps AWS will release their API to the world and allow all businesses to use it openly, but they haven't yet, and so it's a very risky move to bet an architecture on AWS and any vendors (e.g., Eucalyptus) that AWS will bless."

For a start there are very many vendors who have implemented parts of the AWS API, and Eucalyptus have licensed the API so that they have access to the test suites. Nothing stops other vendors from doing the same.

It would only be a risky move to bet on an architecture that might go away. Betting on the industry leader with the dominant ecosystem looks like the lowest risk option to me.

Still, everyone is free to ignore NetflixOSS, there are plenty of other cloud architectures. But Netflix is innovating and scaling faster than the competition because we are also leveraging the innovation and scale of AWS.
<<   <   Page 3 / 5   >   >>
Google in the Enterprise Survey
Google in the Enterprise Survey
There's no doubt Google has made headway into businesses: Just 28 percent discourage or ban use of its productivity ­products, and 69 percent cite Google Apps' good or excellent ­mobility. But progress could still stall: 59 percent of nonusers ­distrust the security of Google's cloud. Its data privacy is an open question, and 37 percent worry about integration.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - September 10, 2014
A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? When it comes to big data, one size doesn't fit all. Here's how to decide.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.