The road to data center automation has been long and arduous, and even innovative companies like eBay haven't gotten it perfected. But eBay's closer than most.
The company's a third of the way through a major three-year grid computing initiative, hard at work developing software and employing technologies that can describe the relationships between and among hardware and software in its data centers with the eventual goal of making eBay easier to manage, quicker to upgrade, and scalable beyond imagination.
At eBay's scale, commercial software is often out of the question. After all, there are more than 3 petabytes of data to manage across six data centers. Even with such a complicated environment, the company's able to manage all that storage with fewer than a dozen administrators. As more management tools and methods fall in line, eBay is hoping to take a step toward automatic service-level management at scale.
There's plenty of information to sort through at eBay. The company's site lists 106 million active auctions at any given point in time, which can be accessed by any of its 243 million registered users. All told, the company has to manage more than 3 petabytes of data spread across about 600 production database instances, many of which run inside virtual machines, sitting on more than 100 computing clusters in six different data centers. One of the reasons this complexity -- the storage piece, anyway -- can be managed with so few people comes down to limiting the number of ways data is stored.
Many companies already do this on a lesser scale. For example, a company might store business-critical information on redundant, high-performance SANs and older information in backup. For eBay, that means fewer vendors and products to manage overall, and fewer simultaneous storage-related processes to manage.
"Management is made easier by having fewer things to manage," eBay distinguished research scientist Paul Strong, who helps to design and manage the company's massive IT infrastructure, said in an interview. "By having patterns and fixing processes around them, you minimize variability, risk, and cost, and you maximize efficiency and to some degree agility."
Beyond simplifying the IT infrastructure, management software is key for eBay, especially since the Web site is constantly changing. "Whenever we deploy new code or change something, we have to examine which other sets of services inside eBay's infrastructure they interact with," Strong said.
In some cases, eBay's infrastructure is so large that commercial products just won't cut it. "We would prefer commercial off-the-shelf software if it could meet our needs," Strong said. "However, historically we have tended to break most that we have used." Though distributed resource management tools -- software to manage grids -- from companies including Gemstone, GigaSpaces, and Oracle's Tangosol have evolved over the last few years, Strong said the functionality wasn't available when eBay first required it, which forced eBay to develop its own custom management software.
EBay uses commercial tools when it can. It plans to adopt those types of distributed resource management tools over the medium term. "After all, our business is about running a virtual economy, not designing and buildings systems and enterprise management tools," Strong said.
For the time being, new semantic and modeling technologies help eBay describe its systems, understand how they relate to one another, and even discover systems it didn't previously know were there. For example, eBay is beginning to use the Resource Description Framework and the Web Ontology Language, two Semantic Web technologies, to "store and query relationships" between and among the software and hardware in its networks.
EBay also is working with others to create standard ways to describe how software and devices in a network relate to one another, known as modeling. It monitors many of the emerging modeling standards groups, but the company chairs the Open Grid Forum's Reference Model working group, because, according to Strong, the OGF is "the only place specifically focused on large distributed systems." Strong also acts as chair of the OGF itself.
The OGF's Reference Model group's focus is to develop a common modeling language to unify other standards like the Information Technology Infrastructure Library (ITIL) and Distributed Management Task Force (DMTF). "No one tool can manage the modern data center, so interoperability is absolutely critical," Strong said. The work eBay has done with the OGF has informed its own ontology, which could provide a starting point for implementing future technologies that take advantage of these emerging standards to simplify distributed management.
With these and other technologies in place, eBay is already able to automatically provision and monitor its systems. The company reprovisions its entire auction platform, including more than 16,000 application instances on more than 8,000 systems, every two weeks. Eventually, however, the company would like to be able to dynamically shift system resources to meet real business requirements.
"In an ideal world," Strong said, "the goal would be for eBay's auction site to be able to say, it should take users this long to do this part of the workflow and therefore we should apply some algorithms and automatically apportion systems."
Partially in order to make the systems based on these management technologies more powerful, eBay is also adopting SOA-style software development, componentizing applications into their building-block pieces. "The real art form is not just understanding the infrastructure, but understanding the application that runs on it," Strong said.
EBay already has taken plenty of details into consideration to keep performance up. For example, its six data centers are all located in the western United States, partially because with so much data, all of them have to be active, and locating them far away from one another would lead to unacceptable latency, despite the fact that eBay also tries to minimize the number of "cross calls," or messages sent between databases.