available by breaking it into pieces and encoding each piece with redundant metadata. It provides increased availability, said Leung. "The metadata is tiny," so it doesn't use up much storage space. "It takes up far less than 1% of the capacity."
Erasure coding lets the lab take advantage of commercial technology developed for cloud computing that DOE didn't have to develop itself, Grider said.
The lab will use disks as the storage medium, which can keep a significant portion of a project's data online for months while work is being done with it. Flash memory is good for this type of storage and it is fast but expensive. "We can't afford to have a lot of that," Grider said. Tape is cheap, but a parallel tape system is infeasible at petabyte scales. It can take a hundred tape drives days to write out a petabyte. Disks are not quite as cheap as tape, but they are a good compromise between cost and speed.
"That is not to say that tape will leave our environment entirely," Grider said. It still is a good medium for long-term storage when data is not being actively used, he noted.
We've only begun
With a federal initiative to make big data available and to make use of it, the topic is getting a lot of attention in agencies, but to date it has produced more interest than results, said former NGA exec Wallach.
"It's relatively new," he said. Best practices and tools still are being developed. "There is progress and people are working hard at it," but it is too early to see many mature products.
If anyone has had practical experience in using big data it probably is the intelligence community, which has for years been looking for correlations and anomalies in large amounts of data. As the volume of data grows, these tasks become like trying to spot a tree in a forest or a needle in a haystack. The National Security Agency has for some time needed tools to spot trees and needles, said Jasper Graham, former NSA technical director.
"You don't want to wait for a massive event to occur" before looking for the indicators, Graham said.
When looking at a massive data stream, such as the .mil networks that NSA is responsible for, the right tools can accumulate small occurrences to the point that they become significant long before they could be spotted manually. "When you have big data, the small trickle starts to look like a big hole in the dam," Graham said. "It really starts to stick out."
This allows network defenders to move beyond traditional security techniques, such as signature and malware detection, and look for activity that indicates the initial stages of an attack or intrusion. Doing this requires first normalizing data so that outliers can be spotted that could indicate an adversary doing reconnaissance and preparing the ground for malicious activity. Agencies and industry are cooperating to develop the algorithms to allow critical indicators to "bubble up" and become visible within big data, Graham said. "Big data tools are getting much better."
One example of this cooperative effort is the Accumulo data storage and retrieval system, developed originally by the NSA in 2008 and based on Google's Big Table database. NSA released it as an open source tool in 2011 and it now is managed by the Apache Software Foundation. Apache Accumulo is built on top of Apache Hadoop, Zookeeper, and Thrift.
Accumulo enables the secure storage and retrieval of large amounts of data across many servers, with access controls based on the classification level of each piece of information. It is a type of NoSQL database that operates at big data scales in a distributed environment. What distinguishes Accumulo from other big data storage tools is the ability to tag each data object to control access, so that different kinds of data with different classification levels can be stored in the same table and be logically segregated. This cell-based access control is more fine-grained than most databases. This collaboration between government and the private sector is making the use of big data practical in other arenas, such as medical research and fraud detection in government benefit programs.
But we have yet to see the full potential of it, Graham said. "I think it's getting there, but there is huge room for improvement," he said. "We're at the tip of the iceberg."
Find out how NASA's Jet Propulsion Laboratory addressed governance, risk, and compliance for its critical public cloud services. Get the new Cloud Governance At NASA issue of InformationWeek Government Tech Digest today. (Free registration required.)