Where Agile Development Fails: IT Operations

Agile developers want fast and frequent deployments. IT operations teams want stability. A growing movement is trying to bridge the gap.
Alternatives From The Web

Computer science courses seldom mention operations, and freshly minted developers naturally gravitate toward "the cool stuff they're going to build" as opposed to the more staid disciplines of keeping systems running, says Scott Ambler, chief methodologist for agile and lean development in IBM's Rational division and columnist for Dr. Dobb's ( "It's really a blind spot," Ambler says. Even experienced enterprise IT developers may lack the broad range of experience to prepare them for sharing part of the IT operational responsibility.

Yet there might be hope sprouting in the very "cool" companies where young, ambitious programmers want to work: Web companies like Google, Facebook, and that don't tolerate gaps between development and operations. Amazon CTO Werner Vogels has famously said that developers should also be operators. Amazon's e-commerce operation is now organized around discrete application services that are called through an API. Developers of a service at Amazon bear the primary responsibility for its operation throughout its life cycle. "You build it, you own it," Vogels said in the May 2006 issue of the Association for Computing Machinery's Queue magazine.

Google and Facebook build new releases and put them in production on a weekly basis, says Ambler. These frequent releases increase risk of failure from an operations point of view, but they also reduce risk by keeping the number of changes included in an update small and manageable. That way, they know where to look if there's a problem.

Flickr goes a step further and issues multiple releases of production systems daily, says Humble. That led to Flickr experiencing four outages recently, but each lasted only about six minutes, he says--the amount of time that Flickr developers needed to isolate the most recent changes and identify what was wrong. Outages in a new release of a typical enterprise system are much harder to track down because of the volume of changes.

But large, uniform Web applications such as Facebook or Google are much different from the enterprise data center, with its complex mix of heterogenous applications. When an outage occurs in that environment, it usually results in the dreaded "bridge call," pulling every expert on the infrastructure together for a lengthy troubleshooting session.

Nationwide Embraces Agile

Tim Heller has lived through that more than once at Nationwide, in his former role as associate VP of IT for applications at the insurance and financial services company. "I remember one time I was having a cookout at my home for 40 people, and I was inside on a bridge call," he says. "I understand operational problems."

Heller is now associate VP of IT for applications in Nationwide's Development Center, where he leads 26 development teams that use agile methods to partner with 7,000 IT staffers distributed throughout Nationwide's 23 business units. Ideas for new services filter up from the business areas; when a project gets the OK, the Development Center provides teams of about a dozen people to work through a project, bringing project management and development techniques, including a heavy emphasis on agile tenets of frequent software builds and daily interaction with business sponsors. An operations staffer who understands the business value of the project is recruited to the team to provide documented input, such as if the app will need to scale up for peak demand at the end of each quarter.

Heller calls the process they follow "acceptance test-driven development." The code that's turned over to production generally deploys without mishap because it has been written and tested with operations' concerns in mind. The test environment "mimics the production environment," says Heller (see story, p. 30). Automated testing kicks off each time developers complete a build, and it takes minutes, compared with the days previously spent in manual testing of a completed project. "We know almost immediately if it doesn't perform as expected," says Heller.

Development Center teams have produced 100 applications, and 70% have been defect-free, he says. Before using the short build-and-test cycles, deployment was "manually intensive--an up-all-night event. Now we run a script. … We deploy in a fraction of the time and know within an hour or a few hours if the deployment has succeeded," Heller says.

Nationwide doesn't exactly live the Amazon dictum--Nationwide's version is more like "you build it, you run it (for just a little while)."

"We try to keep pure project development and operations somewhat segregated," Heller explains. A development team is responsible for how the software runs in production for a short time after deployment. Then that development team pivots back to producing new features.

This approach to development has let Nationwide reduce the number of people devoted to testing from 25% to 30% of each team to 10% to 15%. The process is working well enough that Nationwide plans to increase its agile Development Center to 60 teams, from 26, by 2014.

The complications of integrating agile development and IT operations will vary with each industry and company. For example, it can be very difficult when software is being created for deployment in a third-party enterprise customer's data center, says Todd Little, an agile leader in the Landmark Software and Services unit of Halliburton. Landmark creates software for oil and gas companies, and those customers prefer infrequent updates. "We've introduced barriers to slow down the ability to push out changes," Little says. One of those is to check the intellectual property of the code often during development--to spot patent violations or help determine whether Landmark should apply for a patent.

Little is torn about how much to use agile. As the former chairman of the Agile2011 conference for the Agile Alliance, he's a true believer in agile methods. But he admits that there are challenges on the enterprise level that haven't been sorted out, such as the rate of change that's best for operations and IP and compliance issues.

The big-picture goal now is to build more quality into the agile process from project start to finish. The auto industry went through this evolution, where designers had to learn to create cars while considering whether the models were practical to manufacture. There's no one right way to meet the goal of creating software that meets the needs of business units and IT operations. The dev ops goal isn't to build precise, highly defined contributions into every project. Rather, it's "for each side to help the other," says IBM's Ambler.

In the past 12 months, which of these strategies and tactics sped up your IT organization's ability to deliver projects?

Go to the sidebar:
Why App Dev Needs A Better Deployment Pipeline