How We'd Fix Scalability Experts Speak - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

Healthcare // Policy & Regulation
10:48 AM
Connect Directly

How We'd Fix Scalability Experts Speak

Martin Abbott and Michael Fisher, consultants on building large-scale websites, say one answer to retooling is creating "swim lanes" for each state the federal website serves.

Top 10 Government IT Innovators Of 2013
Top 10 Government IT Innovators Of 2013
(click image for larger view)

When choked on the volume of uninsured citizens who flocked to it on Oct. 1, it was following an all-too-common pattern that occurs on other websites with big ambitions, according to two men who have built their reputations on expertise in web scalability.

"We think it's important to understand, first and foremost, that to say is unique in its problems is simply incorrect," Martin Abbott and Michael Fisher wrote in an email leading up to an interview with InformationWeek. "Many successful product-focused companies have had similar crises, including difficulties in scaling their solution or providing high availability, but have overcome these challenges and ultimately delivered great products and services to their customers. We have witnessed and helped overcome scalability challenges that go as far back as eBay's outage of 1999."

The scalability crisis at eBay occurred shortly before Abbott arrived at the company, where he rose to serve as senior vice president of technology and CTO. Fisher is a former Paypal vice president of engineering and architecture. They worked together at the advertising technology startup Quigo before co-founding AKF Partners, a consulting firm focused on scalability issues. They are the co-authors of two books on the topic, The Art of Scalability, and Scalability Rules: 50 Principles for Scaling Web Sites.

[ Is Obama fixing as fast as promised? Read Hitting Performance Targets, Says Zients. ]

"We have worked with well over 300 companies, many of which have grown from startups to successful Fortune 1000 companies by experiencing similar problems and successfully overcoming them," they wrote. "These companies prove that technology turnarounds can in fact happen."

With the Obama administration declaring victory in its push to get the website working by the end of November for the majority of citizens who try to enroll, InformationWeek sought advice from a variety of leaders with relevant experience on how they would recover from an IT flop on the scale of Although much improved, the federal website serving the 36 states that opted not to create their own health insurance portal still faces scalability issues. The Obama administration says it's now capable of handling 50,000 concurrent users but warns the site will periodically experience larger volumes, meaning not everyone will have a good experience with the site even after two months of retooling. is more than an ordinary website in that it must make back-end connections to many other government and insurance company systems -- but even in that it is not so different from large financial services websites that have their own complex back-end connectivity requirements to other institutions, Abbott and Fisher say.

In a phone interview, Abbott and Fisher agreed that it might have been better for the Obama administration to rely on a team of Silicon Valley pros with experience building websites, rather than a bunch of government contractors -- except that was probably never in the cards. "I don't know where the government would have gone to get the right talent," Abbott said. "Companies that do this well aren't in the business of doing it for other companies." AKF Partners is the rare exception, and it's not geared to bidding on government contracts.

When the Obama administration announced a "tech surge" aimed at fixing the website's problems, there were probably a lot of technology professionals who immediately thought of Brook's law, the principle taken from Fred Brook's technology management classic, The Mythical Man-Month, that says "adding manpower to a late software project makes it later." That's a useful warning, but in this case it's irrelevant because the system was delivered on time -- it just didn't function correctly, Abbott said. "The team just needs to manage their way through it," he said, and having a large team could be helpful as long as resources are allocated properly.

The only way to tackle the issues with an underperforming site -- including fundamental issues of site architecture -- is to prioritize them and chip away at them with a disciplined system of issue tracking and management, Abbott and Fisher said.

Most organizations learn the lessons of scalability the hard way, when traffic to their sites grows unexpectedly quickly or just becomes more than they know how to deal with. "It's typical -- and very fixable," Fisher said.

Scalability problems of this sort are typically of a monolithic architecture, meaning that the software developers have centralized too much functionality in software modules and databases that then become overloaded and fail. One common mistake is to rely on the promises of database and middleware software vendors who might claim that their clustering or their NoSQL innovations can distribute workloads across multiple servers, thereby providing scalability in a nice, neat package. But these technologies, too, tend to be overwhelmed by very large scales.

What Abbott and Fisher recommend is a strategy of partitioning site functionality into "swim lanes" -- also referred to as "pods" in cloud computing lingo -- which are complete units of software functionality that are "fault isolated," meaning they function with minimal dependencies on each other and can fail without taking down the rest of the system. Very large websites use a variety of strategies for partitioning the workload on a website. For example, an online store that looks like one huge product catalog might actually have one subsystem managing hardware while another is responsible for apparel. Social media websites might assign blocks of users to a particular cluster of servers, rather than having every server responsible for modeling the relationships between all members.

If Abbott and Fisher were in charge of retooling, one strategy they would consider is creating swim lanes for each state the federal website serves, rather than having one system to address the needs of 36 states. Since one of the first steps in the enrollment process is for people to say where they live, the core website would not have to serve up much beyond the home page; users would complete their enrollment on a cluster of servers devoted to their state and be redirected there on subsequent visits.

Another advantage of creating multiple semi-independent systems is that any new software code can be deployed first to a subset of the total user base. That way, any software errors that got past the initial quality assurance checks will be exposed to just a fraction of the total audience. This strategy is important because even with thorough simulation and testing, not every problem will be caught in advance; live production systems always behave a little differently in practice than in theory.

That's one reason Abbott and Fisher disagree with the recommendation made by computer security experts, both in interviews with InformationWeek and Congressional testimony, that be shut down to avoid risk of a breach while repairs were underway.

"I don't know why those experts say that," Abbott said. "The way you fix these things is you make a change and identify whether that works." Although it might be easy to criticize the Obama administration for being unwilling to take the site offline for political reasons, no business in the same situation would stop doing business while trying to retool, he noted. would never shut down for days or weeks and let all its customers go to Wal-Mart, he said. While working on repairs, you just have to keep the current site working as best you can.

Part of the problem with the initial launch was that it was a "big bang," an attempt to implement a large, complex system all at once and at scale. "The problem is big bangs cause big bangs," as in big problems, Fisher said. "Going live all at once introduces a lot more risk than small iterative changes [made] day-by-day to the system." Taking the site offline and retooling it according to hypothetical ideas about what would make it work better would just introduce a second big bang when the site was relaunched, he said.

"That's another example of why they shouldn't take this down, because they should be learning from their users," Fisher said. "If you take it down and reimplement it with another big bang, you don't get those [lessons]."

Follow David F. Carr on Twitter @davidfcarr or Google+. He is the author of Social Collaboration For Dummies (October 2013).

Though the online exchange of medical records is central to the government's Meaningful Use program, the effort to make such transactions routine has just begun. Also in the Barriers to Health Information Exchange issue of InformationWeek Healthcare: why cloud startups favor Direct Protocol as a simpler alternative to centralized HIEs. (Free registration required.)

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
David F. Carr
David F. Carr,
User Rank: Author
12/3/2013 | 7:16:13 PM
Re: Take it offline?
To be clear, I should say I approached them as scalability experts to talk about scalability -- which was clearly one of the big issues with, but the only one. They didn't claim any insider knowledge of the bugs to be fixed with this system, but I still think their analysis of how this falls into a common pattern is useful.

It's easier to dream big dreams about creating a word-beating website than it is to be ready when the world really does come beating down your door.
User Rank: Author
12/3/2013 | 6:45:14 PM
Program Management
Some good points here about scaling the website. But it's worth noting, streamling web operations is only part of the larger picture, especially in working with CMS, IRS, Social Security and other agency data.

Richard Spires, former DHS CIO, makes the additional point elsewhere in this series about the importance for organizations to understand the need for strong program management capabilities to succeed with big IT projects.   Read more at:
User Rank: Apprentice
12/3/2013 | 5:29:28 PM
Re: Take it offline?
As a Project Management professional I do not know the technical fix but I can tell you when project failure occurred - it was right in the startup phase. Inadequate definition of the product followed by lack of effective change control and quality assurance plan.
User Rank: Strategist
12/3/2013 | 5:15:49 PM
The Way Forward
Those pretending that the problems being encountered are glitches, kinks, or simply bugs to be fixed and hoping that the problems will simply dissipate with the relaunch need to cease their wishful thinking. It is time to insist on the professional management steps needed to get to the bottom of this and right the ship.  

The following ten steps are called for immediately:

1. Use of ACA website by customers seeking healthcare insurance should be terminated. 

2. Existing customer profile data, personal data, and decision data should be quarantined. 

3. The ACA website requirements foundation and technical architecture should be reviewed, assessed, and audited by a team of experienced industry experts.

4. The management, engineering, and process practices employed on the project should be reviewed, assessed, and audited by a team of experienced industry experts.

5. The accumulated Technical Debt on the project should be reviewed, assessed, and audited by a team of experienced industry experts.

6. A professional team should be charged with assembling factual analytics associated with assurance metrics, compliance metrics, noncompliance metrics, product engineering metrics, project management metrics, and process metrics. 

7. A full scale program review should be conducted to assess requirements, architecture, practices, and metrics. The review team should record its findings and consequences and provide recommendations and rationale for carrying the project forward.

8. A professional team should be charged with assessing Cyber Security vulnerabilities in accordance with the NIST Cyber Framework.

9. A professional team should be charged with assessing privacy and civil liberties vulnerabilities in accordance with the NIST Cyber Framework.

10. The completion date for these activities should be established as January 15, 2014.
User Rank: Apprentice
12/3/2013 | 5:12:18 PM
Re: Take it offline?
Abbott and Fisher being scalability experts, make it sound like scalability as the only Helathcare.Gov problem. I am not sure that   deploying poor quality code on different servers would have made the difference as they seem to suggest.

The big problems with Healthcare.Gov are yet to come. What we now have is a patched up barely functioning product. We will be spending more than 80% of the O&M spend (My estimate $60 million per year) for fixing bugs. And a lot of the bugs will be cybersecurity vulnerabilities leading to cyber attacks.
Shane M. O'Neill
Shane M. O'Neill,
User Rank: Author
12/3/2013 | 4:02:00 PM
How to get back on track?
"The only way to tackle the issues with an underperforming site -- including fundamental issues of site architecture -- is to prioritize them and chip away at them with a disciplined system of issue tracking and management."

This seems to be the most logical approach right now for, rather than adding more manpower (more chefs in the kitchen). Any other website architects out there been in a similarly hellish position as the team? How did you turn it around?
David F. Carr
David F. Carr,
User Rank: Author
12/3/2013 | 11:25:44 AM
Take it offline?
One big point of debate seems to be whether it would have made sense to take the site offline while repairs were under way. Abbott and Fisher make a good case about why it was better to keep it operating. On the other hand, some of the deeper problems with the system beyond scalability and availability may have long-term consequences. The Washington Post is reporting that as many as one third of the enrollments processed contained data errors, meaning that people who enrolled for ensurance may not be signed up for the plan they expected at the price they expected.
How COVID is Changing Technology Futures
Jessica Davis, Senior Editor, Enterprise Apps,  7/23/2020
10 Ways AI Is Transforming Enterprise Software
Cynthia Harvey, Freelance Journalist, InformationWeek,  7/13/2020
IT Career Paths You May Not Have Considered
Lisa Morgan, Freelance Writer,  6/30/2020
White Papers
Register for InformationWeek Newsletters
Current Issue
Special Report: Why Performance Testing is Crucial Today
This special report will help enterprises determine what they should expect from performance testing solutions and how to put them to work most efficiently. Get it today!
Flash Poll