Report Blames Northrop Grumman For Virginia Outages
A review of the week-long statewide network outage said it was caused by a combination of human error, faulty hardware, and a failure to follow best practices.
Slideshow: Government Innovators
(click image for larger view and for full slideshow)
Faulty hardware and Northrop Grumman's failure to follow best practices were responsible for a statewide IT system failure in Virginia last summer that affected online services and network operations for a week, according to a report on the incident released by Virgnia Gov. Robert McDonnell.
The independent review -- prepared by Agilysys, an IT services firm -- found that the combination of the failure of a data storage system and then human error during an attempt to replace one of the failed memory boards caused the unprecedented outage, which affected more than 20 government agencies.
The report also faulted Northrop Grumman, which has a $2.3 billion contract to work with the Virginia Information Technologies Agency (VITA) to look after communications and computer services for the state, for not adhering to industry best practices following the incident. VITA was created in 2003 to maintain and modernize the state's IT operations.
The commonwealth's trouble began Aug. 25 when two memory boards that were meant to back up each other failed. Analysis by EMC, the manufacturer of the boards, said a so-called "electrical over stress condition at the component level" caused the dual failure, which resulted in a loss of data.
Following that, "human error during the memory board replacement process resulted in the incurred extended outage," according to the report.
The outage also was exacerbated by a gap in the Information Technology Service Continuity Management (ITSCM) processes, which resulted in the spread of corrupt data. Lack of a continuity procedure also was one of the reasons it took 18 hours to get the system back up and running, according to the report. Full service to all affected operations and agencies did not return until about a week later.
Specifically, parties responsible for responding to the incident did not suspend what's called Symmetrix Remote Data Facility (SRDF) before the memory board replacement process, which "negatively impacted the data recovery procedures" and allowed corrupt data to be replicated.
SDRF is a process used to replicate data from a local storage array to a remote storage array. The report cites Northrop Grumman as the responsible party for managing risk during the SRDF process.
Northrop Grumman spokeswoman Christy Whitman said the company has been "working hard" since the outage to "make the appropriate improvements to help avoid or mitigate similar disruptions."
The company also is ready to talk with Virginia officials about how best to implement report recommendations, she added.
It's still not known how much the outage will cost the commonwealth and if and how Northrop’s relationship with VITA will be affected. State officials long have criticized the partnership, which has had its troubles over the years.
How Enterprises Are Attacking the IT Security EnterpriseTo learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
IT Strategies to Conquer the CloudChances are your organization is adopting cloud computing in one way or another -- or in multiple ways. Understanding the skills you need and how cloud affects IT operations and networking will help you adapt.