When HealthCare.gov choked on the volume of uninsured citizens who flocked to it on Oct. 1, it was following an all-too-common pattern that occurs on other websites with big ambitions, according to two men who have built their reputations on expertise in web scalability.
"We think it's important to understand, first and foremost, that to say HealthCare.gov is unique in its problems is simply incorrect," Martin Abbott and Michael Fisher wrote in an email leading up to an interview with InformationWeek. "Many successful product-focused companies have had similar crises, including difficulties in scaling their solution or providing high availability, but have overcome these challenges and ultimately delivered great products and services to their customers. We have witnessed and helped overcome scalability challenges that go as far back as eBay's outage of 1999."
The scalability crisis at eBay occurred shortly before Abbott arrived at the company, where he rose to serve as senior vice president of technology and CTO. Fisher is a former Paypal vice president of engineering and architecture. They worked together at the advertising technology startup Quigo before co-founding AKF Partners, a consulting firm focused on scalability issues. They are the co-authors of two books on the topic, The Art of Scalability, and Scalability Rules: 50 Principles for Scaling Web Sites.
[ Is Obama fixing HealthCare.gov as fast as promised? Read HealthCare.gov Hitting Performance Targets, Says Zients. ]
"We have worked with well over 300 companies, many of which have grown from startups to successful Fortune 1000 companies by experiencing similar problems and successfully overcoming them," they wrote. "These companies prove that technology turnarounds can in fact happen."
With the Obama administration declaring victory in its push to get the website working by the end of November for the majority of citizens who try to enroll, InformationWeek sought advice from a variety of leaders with relevant experience on how they would recover from an IT flop on the scale of HealthCare.gov. Although much improved, the federal website serving the 36 states that opted not to create their own health insurance portal still faces scalability issues. The Obama administration says it's now capable of handling 50,000 concurrent users but warns the site will periodically experience larger volumes, meaning not everyone will have a good experience with the site even after two months of retooling.
HealthCare.gov is more than an ordinary website in that it must make back-end connections to many other government and insurance company systems -- but even in that it is not so different from large financial services websites that have their own complex back-end connectivity requirements to other institutions, Abbott and Fisher say.
In a phone interview, Abbott and Fisher agreed that it might have been better for the Obama administration to rely on a team of Silicon Valley pros with experience building websites, rather than a bunch of government contractors -- except that was probably never in the cards. "I don't know where the government would have gone to get the right talent," Abbott said. "Companies that do this well aren't in the business of doing it for other companies." AKF Partners is the rare exception, and it's not geared to bidding on government contracts.
When the Obama administration announced a "tech surge" aimed at fixing the website's problems, there were probably a lot of technology professionals who immediately thought of Brook's law, the principle taken from Fred Brook's technology management classic, The Mythical Man-Month, that says "adding manpower to a late software project makes it later." That's a useful warning, but in this case it's irrelevant because the system was delivered on time -- it just didn't function correctly, Abbott said. "The team just needs to manage their way through it," he said, and having a large team could be helpful as long as resources are allocated properly.
The only way to tackle the issues with an underperforming site -- including fundamental issues of site architecture -- is to prioritize them and chip away at them with a disciplined system of issue tracking and management, Abbott and Fisher said.
Most organizations learn the lessons of scalability the hard way, when traffic to their sites grows unexpectedly quickly or just becomes more than they know how to deal with. "It's typical -- and very fixable," Fisher said.
Scalability problems of this sort are typically of a monolithic architecture, meaning that the software developers have centralized too much functionality in software modules and databases that then become overloaded and fail. One common mistake is to rely on the promises of database and middleware software vendors who might claim that their clustering or their NoSQL innovations can distribute workloads across multiple servers, thereby providing scalability in a nice, neat package. But these technologies, too, tend to be overwhelmed by very large scales.
What Abbott and Fisher recommend is a strategy of partitioning site functionality into "swim lanes" -- also referred to as "pods" in cloud computing lingo -- which are complete units of software functionality that are "fault isolated," meaning they function with minimal dependencies on each other and can fail without taking down the rest of the system. Very large websites use a variety of strategies for partitioning the workload on a website. For example, an online store that looks like one huge product catalog might actually have one subsystem managing hardware while another is responsible for apparel. Social media websites might assign blocks of users to a particular cluster of servers, rather than having every server responsible for modeling the relationships between all members.
If Abbott and Fisher were in charge of retooling HealthCare.gov, one strategy they would consider is creating swim lanes for each state the federal website serves, rather than having one system to address the needs of 36 states. Since one of the first steps in the enrollment process is for people to say where they live, the core website would not have to serve up much beyond the home page; users would complete their enrollment on a cluster of servers devoted to their state and be redirected there on subsequent visits.
Another advantage of creating multiple semi-independent systems is that any new software code can be deployed first to a subset of the total user base. That way, any software errors that got past the initial quality assurance checks will be exposed to just a fraction of the total audience. This strategy is important because even with thorough simulation and testing, not every problem will be caught in advance; live production systems always behave a little differently in practice than in theory.
That's one reason Abbott and Fisher disagree with the recommendation made by computer security experts, both in interviews with InformationWeek and Congressional testimony, that HealthCare.gov be shut down to avoid risk of a breach while repairs were underway.
"I don't know why those experts say that," Abbott said. "The way you fix these things is you make a change and identify whether that works." Although it might be easy to criticize the Obama administration for being unwilling to take the site offline for political reasons, no business in the same situation would stop doing business while trying to retool, he noted. Target.com would never shut down for days or weeks and let all its customers go to Wal-Mart, he said. While working on repairs, you just have to keep the current site working as best you can.
Part of the problem with the initial launch was that it was a "big bang," an attempt to implement a large, complex system all at once and at scale. "The problem is big bangs cause big bangs," as in big problems, Fisher said. "Going live all at once introduces a lot more risk than small iterative changes [made] day-by-day to the system." Taking the site offline and retooling it according to hypothetical ideas about what would make it work better would just introduce a second big bang when the site was relaunched, he said.
"That's another example of why they shouldn't take this down, because they should be learning from their users," Fisher said. "If you take it down and reimplement it with another big bang, you don't get those [lessons]."
Though the online exchange of medical records is central to the government's Meaningful Use program, the effort to make such transactions routine has just begun. Also in the Barriers to Health Information Exchange issue of InformationWeek Healthcare: why cloud startups favor Direct Protocol as a simpler alternative to centralized HIEs. (Free registration required.)