Computer, Heal Thyself

Software increasingly will be able to adapt without human intervention. Will managers be able to let go?
Too bad computers aren't more like people. When we work harder, our hearts beat faster. When we're hot, we sweat. But in the 54 years since British mathematician Alan Turing introduced the notion of artificial intelligence, computer scientists haven't delivered anything close to a self-aware and self-healing computer.

That may change soon enough. Researchers in business and government labs are building systems that will challenge what it means to be an IT worker by automating many of the monitoring and maintenance tasks done today by hand. Scientists in labs from IBM to the Department of Defense are developing adaptive systems able to manage, heal, defend, configure, and optimize themselves without human intervention, just as our bodies can combat an infection without conscious effort. While that ultimate goal is still years away, the first generation of business-ready, self-adapting software tools is within reach.

Biology is more than an analogy here. Researchers have turned to the natural world for inspiration in developing adaptive technologies flexible enough to cope with the increasingly complex computer and business systems that drive our world. Whether through cell technology or the social interactions of ants, biology provides ideas for researchers who want computer users to be able to focus on business goals without having to tell a server the optimum way to do so. It's similar to how a person focuses on a goal--say, climbing a mountain--without consciously telling the body everything it must do to get there.

The motivating factor behind it all: to wage war on complexity. The interlocking pieces of software that make up business computer networks will soon be beyond the comprehension of most IT workers. Plus, these complex systems tend to be fragile, breaking down when even minor changes are made. The complexity results in cost overruns, implementation delays, staffing problems, productivity losses, and missed business opportunities. "As we talked to our customers, we kept hearing the same refrain," says Greg Burke, director of IBM's eLiza project. "Technology is too complex and IT departments are having trouble keeping up with the maintenance requirements in their multivendor, multiarchitecture environments."

The eLiza project team at IBM is developing hardware, software, and networks that will be able to allocate computing resources as needed, safeguard data, and ensure business continuity in case of a disaster. It's already brought some of that effort to the market, such as the eLiza E-business-management services that match IT resource availability with business requirements to make sure business-performance levels are met.

One major challenge to implementing self-managing technology will be prying the fingers of managers from the controls of their IT systems. As with any new automated technology, the transition will be difficult, as the IT equivalent of John Henry tries to prove he can still optimize a network better than a computer can. George Vrabel, a Jacksonville, Fla., senior audit director for Bank of America Corp., is mindful of the fact that fixing a bug often introduces new errors. "Self-healing and-configuring systems will be good--as long as they don't create other problems," he says.

The majority of the auditing work to make sure computer architectures and IT security meet bank policies is done manually today, Vrabel says. He'd be comfortable using a proven system, he says, though he would still want to be informed when something breaks, even if the system can fix itself.

Vendors are starting to deliver some early examples of self-healing software features to the marketplace, such as Windows XP's ability to automatically grab software updates over the Internet, notes Charles Nettles, chief technology executive at McKesson Corp., a pharmaceutical and medical-supply company in San Francisco. Nettles expects such tools to greatly boost worker productivity by reducing downtime, but his experience has taught him to keep McKesson from being too early an adopter. "We all know that vendors are capable of making extraordinary claims," he says. "How many times have we heard a vendor say its products were fully integrated only to find they weren't?"

Creating software that's aware of its own behavior and that of interacting components has been a largely unrealized dream in the industry. The most successful attempts are in the area of hardware configuration, such as servers that automatically switch among redundant disks to prevent a total system shutdown. The computer switches that run telecommunications networks also use advanced routing algorithms to move traffic around system outages.

Microsoft has had mixed results trying to make its systems easier to administer. In 1997, it brought out its Zero Administration initiative for automating the administration of client PCs running Windows. It included automatic updates of the client operating system and applications, centralized administration and system lockdown, and persistent caching of data and configuration information. But IT managers complained the utilities often caused software incompatibilities and system failures, says Craig Mundie, Microsoft's chief technology officer. "Going forward, we don't want to force automated configuration on IT managers," he says. "Now, we send them notice of upgrades and let them decide how and when to implement them." Microsoft has been somewhat more successful with self-healing features in its SQL Server database, introduced in 1998. The software that makes it possible is based on control theory, a discipline used by electrical engineers for years and applied to software design by the Oregon Graduate Institute, says David Campbell, a SQL Server architect. Microsoft adapted that research work, along with index-tuning capabilities developed by Microsoft Research, to give SQL Server the ability to adjust to the shifting demands of a live database environment. For example, SQL Server 7 and SQL Server 2000 can allocate memory automatically, where and when it's needed, to address things such as input/output demands or the size of a buffer pool. The self-adjusting capabilities are widely used by SQL Server customers, Campbell says.

Adoption of self-managing systems will play out in three phases, Mundie predicts. First, there will be automatic online updates such as what Windows XP does. Next will come policy-based IT systems, created by adding software to existing hardware and operating systems, which do automatic configurations and software deployments, based on corporate IT policies. These two phases will happen in the next three years, Mundie says. But it will take 10 to 20 years--and require new software, operating systems, and architectures--for true self-healing systems to be adopted by business. The struggle to create those systems is playing out now in government and private labs.

The technology behind IBM's eLiza project is being developed by a team of computer scientists, physicists, and mathematicians at IBM Research. During the past decade, the team has developed learning algorithms that can detect whether a computer system is healthy or in decline. The algorithms can automatically sense when resources need to be reallocated and make the appropriate fixes without human intervention. The technology has been used in the MVS mainframe operating system since 1994--optimizing the thousands of processes, CPU configurations, and other features of the system that became too complex for system administrators to handle manually. Now, IBM's challenge is to deploy the technology in the distributed world.

Enter IBM's Heterogeneous Workload Management software, which uses self-teaching algorithms to automate much of the system configuration work, such as allocating CPU capacity, now done by IT workers. The software takes a snapshot of the system every 10 seconds to monitor changes in performance. It's being tested by IBM customers in the insurance and financial-services fields, and the first iteration will be integrated into IBM server operating systems later this year. It should provide much-needed help: A server can be configured in about 500 ways, so an administrator may take days or weeks to figure out a good configuration for optimizing the company's environment.

Donna Dillenberger

IBM Research's goal is to manage all parts of a computing infrastructure, says senior technical staff member Dillenberger.
Still, the goal is to expand beyond servers to manage all pieces of a computing infrastructure, from operating systems and networks to applications and middleware. "There's no other software today that performs automatic system configuration," suggests Donna Dillenberger, IBM Research's senior technical staff member in Hawthorne, N.Y. "Humans have to hand tune every application, operating system, and every interface between the various parts of a system." The Workload Manager will eventually be integrated into other IBM products, including the Tivoli system-management suite.

If adaptive systems sound too good to be true, IBM's ultimate vision for autonomic systems is even more grand: Managers set business goals, and computers automatically set the IT actions needed to deliver them. For example, in a financial-trading environment, a manager might decide that trades have to be completed in less than a second to realize service and profitability goals. It would be up to tools such as Workload Manager to configure the computer systems to meet those metrics.

Frederic Lalonde, CTO of NewTrade Technologies Inc. in Montreal, is putting self-adapting systems to work in the real world. He's been using Sun Microsystems' Jini networking architecture and development tools for two years to build Web sites such as, a travel portal linking 15,000 businesses. Lalonde uses the Jini network technology to connect small hotels to the travel portal and other reservation systems. Jini handles network errors, distributes code over the network, remotely updates software, and has self-diagnostic capabilities that help it handle IP modifications and network outages without intervention, which is critical because most small hotels don't have IT staffs. The drawback: The system isn't good for high volumes of traffic, Lalonde says.

Yet high volume is driving the need for self-managing technology. As a growing number of devices connect through the Internet, network technologies must deal with the resulting scale and complexity. "We're no longer talking about one person accessing one application on a local PC or on a network server," Sun CTO Greg Papadopoulos says. "We're talking about millions of users accessing a service over the Internet. Given that scope, these services and the underlying infrastructure have to be much more robust than current-day software." There's no alternative to creating support infrastructure that's self-organizing, he says.

Individual components might break, but Jini's sensing ability reallocates system resources to components that are up and running. Jini is built on the idea that everything on a network is or will soon be a service--be it hardware or software resources--and that each service will have a discovery mechanism. When a service goes down, the network heals itself by no longer making that problematic service accessible to other services and devices. Jini is one of the more mature adaptive technologies, with 80,000 licensed developers and 75 commercial implementations, including the U.S. Navy's DD 21 Destroyer's self-healing network.

The concept of making a reliable system out of unreliable parts, essentially conceding that software code will always be flawed, provides a major theme of research being done by the Defense Advanced Research Projects Administration (Darpa), an arm of the Defense Department that funds many of the country's cutting-edge IT research efforts. Darpa

is looking to make greater use of commercial software. Most of the software that runs jet fighters, logistics operations, and other vital functions in the military was developed from scratch and put through rigorous quality testing. That's slow and expensive, so the military wants to take advantage of commercial components.

But commercial software developers are famous for a ship-then-fix mentality, and the military needs a way to insulate its systems from potential failures. That's why Darpa is guiding a project called Dynamic Assembly for System Adaptability, Dependability, and Assurance. The project has many facets, such as one under way at Darpa partner SRI International Inc., a private research lab in Menlo Park, Calif., to build software dependability gauges.

Victoria Stavridou

SRI is working on software dependability gauges for the Defense Department, but Stavridou says she expects that the technology will be used in commercial software during the next couple of years.
Building an application by combining components hasn't worked well because it's so difficult to get pieces to fit together, says Victoria Stavridou, SRI's director of system design. SRI is building these software gauges so developers and the systems themselves can spot problems and allow for greater tolerance of component differences. Stavridou contrasts software interfaces with the physical world of machines, where if one brand of replacement part isn't available, chances are something else will work, even if not exactly as the manufacturer envisioned. "We need to monitor that change and have systems adapt accordingly without requiring human attention," she says.

Though the primary consumer of SRI's work is the Defense Department, Stavridou expects the technology to make its way into commercial software in the next couple of years. Much of the effort

is aimed at correcting past wrongs. Systems were not designed to be configurable, secure, and self-healing. "You don't design a jet aircraft and then think about gravity after the fact," says Stavridou, who developed some of the software that runs the F-14 Tomcat fighter (see story, p. 34). "The same should be true with system security and reliability."

Researchers at Hewlett-Packard Laboratories in Palo Alto, Calif., are working to create an adaptive, self-managing computing infrastructure called Planetary Computing. The Planetary Computing project is envisioned as a 50,000-node computing fabric that can be built, reconfigured, and managed automatically. All system changes will occur in the software.

One element of the ambitious effort, Self-Organized Services, attempts to bring adaptive abilities to the computing infrastructure. The group is trying to develop tools to measure overall system behavior, then evolve to the point that the tools can automatically control behavior of system components in distributed environments, says Rich Friedrich, a principal architect at HP Labs. The final goal is to develop systems that can perform some form of reasoning to assess how a particular device or software component should be operating. Another HP research initiative, called SmartFrog (short for Failure Recovery of Object Groups), is a framework for automatic system configuration. In December, HP released an open-source version upon which commercial products eventually will be based. It's designed to help developers create templates that remember which objects need to be linked and compiled in a particular order to have an application run properly. Today, programmers do this messy work by hand, and it can be difficult for a new programmer just joining a project to retrace a predecessor's footsteps.

The ultimate goal of adaptive computing isn't just to have smart, self-healing systems, but to have smart business processes. That's the prize researchers at Sun are aiming for with a product-forecasting system that constantly monitors its own performance and tests assumptions about business execution. When something out of the ordinary occurs, such as a dip in one product's sales or a spike in a component cost, the software alerts managers and learns to adapt to changes in business data.

The software, developed by a team of engineers at Sun Labs, relies chiefly on the mathematical ideas of Bayesian statistics, which assign probabilities based on past events. "This approach lets the software update assumptions that man-agers have made about how much of a certain product Sun is going to sell and change those assumptions as new sales data comes into the system," says Phillip Yelland, who leads the Sun Labs effort.

Sun has been using the system in a six-month in-house pilot involving seven product lines and 15 business managers. The system can keep several scenarios in play at once, weighing each one against actual performance data and changing the weighting of certain business assumptions as business conditions change to make projections about future sales.

So, is it time to fire the product managers and computer programmers? Certainly not. Managers use the system to keep track of more business factors than they could keep in their heads, but they still need to mix the data and intuition to decide which products to manufacture and when. Says Yelland, "We're not trying to supplant human intelligence, just help it along."