Last September the Brookhaven National Laboratory discovered a way to expand its compute power for particle research without blowing up its research budget. As its needs outgrew its own facilities, it opted to use Amazon Spot Instances -- the virtual servers that customers can use for as long as their low bid isn't topped by someone else's.
It was a choice that seemed risky at the time. Scientists were lined up to run their research systems against mountains of data generated by the CERN Large Hadron Collider in Geneva, but neither Brookhaven nor participating university departments had enough compute capacity to satisfy their demands.
Furthermore, "science is highly competitive," observed lead computer scientist Michael Ernst for Brookhaven's ATLAS team, tied into CERN, in an interview with InformationWeek.
Ernst and the ATLAS team decided to test Spot Instance use in the cloud over a five-day period last September. Such a move might not sound like rocket science to major enterprises already liberally tapping AWS virtual servers. But the needs of particle researchers are extremely large-scale, and there were 1,500 of them waiting in line, and there are known drawbacks to using Spot Instances.
A given particle research system might need to run continuously for 24 hours. Simply because the research team lined up the Spot Instances they needed at the outset, didn't mean they'd still be available as the research ground into its 24th hour. Spot Instances are a bargain in the middle of the night, but can get shifted into higher priced Spot Instances or even On-Demand instances with the dawn of the business day.
"If a system has run 23 hours and 57 minutes, and the Spot Instance goes away, you lose everything," Ernst noted in an interview. That was one of the hazards of selecting what was, by definition, a temporary resource. Spot Instances are unused compute power in the Amazon cloud that is available at whatever price a customer cares to bid for them. They attract the low bidders and typically cost one-quarter to one-tenth of the AWS On-Demand class of servers, Ernst said.
But Brookhaven needed large numbers of them in one location to deal with the terabytes of data being generated by the Large Hadron Collider. For its first major test, Ernst sought the equivalent of 50,000 physical cores to power the Spot Instances needed. The rub was that 99% of them would need to remain available throughout the five-day test period.
All 50,000 wouldn't need to be continuously available. Ernst could afford to have 1% shifted to higher bidders at any one time by pre-arranging for jobs to failover to other virtual servers. But if there was a surge in demand for Spot Instances during his trial, too many servers would be lost to finish many of the running computations.
"Nodes acquired on the Spot market can be terminated at any time, meaning applications need to tolerate disruptions," said Ernst. If the disruptions exceeded the ability of the applications to failover, there were going to be many disappointed researchers, he said.
As Brookhaven prepared its test run on Amazon, it was a rare event to have sufficient data from Hadron/ATLAS loaded into the cloud to host hundreds of research explorations at one time. It takes a trillion proton collisions in the collider to produce evidence of a single Higgs boson particle's decay. Nevertheless, understanding the Higgs boson -- the goal of many ATLAS research workloads -- promises to provide the next refinements in our understanding of the universe, possibly unlocking the secrets to gravity.
[Want to learn more about AWS 2015 results? See Amazon, AWS Post Strong Results, Fail to Please Wall Street.]
Brookhaven was able to load the data into Amazon over the Energy Science Network, operated by the US Department of Energy at 100 Gbs. Moving vast amounts of data -- 50 PBs -- at the slower speeds available over the Internet would not have been tolerable to Amazon, he said.
In some cases, the workloads are using vast amount of data to simulate what should happen in the proton collisions, then search through mountains of ATAS detector data looking for evidence that the theories are correct. It's a compute-intensive task, Ernst explained.
When everything was ready, Brookhaven launched the five-day Spot Instance run. "Less than 1% of the instances were terminated," leaving operations with a margin of safety. Afterward, Ernst's view of Spot Instances changed from a risky experiment to "an ideal resource for deploying our peak demand."
Instead of investing in new data center capacity, Brookhaven was able to gain capacity for its peak demand for $45,000 for the five-day run.
"AWS has superb availability," Ernst said. "It appears to have unlimited capacity at competitive prices."
Even if that was true last September, that's not necessarily guaranteed for all future large-scale users of Spot Instances. With AWS's rapid, 71.7% revenue growth in 2015, compute capacity that's now available might not be in the future.
Nevertheless, Ernst is getting ready for a second experiment on Amazon this month, relying once again on Spot Instances. He's seeking to establish once and for all that the cloud can serve as "a practical, production-grade, 100,000-core compute platform for doing science." It will be conducted over Amazon's three major North American regions: US East in Northern Virginia, US West in Northern California, and US West in Oregon.
Brookhaven has conducted a smaller, 4,000-core, month-long experiment on Google Compute Engine, but hasn't done any yet on Microsoft Azure. Ernst doesn't rule out use of any cloud site in the future.
Rising stars wanted. Are you an IT professional under age 30 who's making a major contribution to the field? Do you know someone who fits that description? Submit your entry now for InformationWeek's Pearl Award. Full details and a submission form can be found here.