Government Toils To Create Big Data Infrastructure
Government is slowly puzzling out how to extract knowledge from the most data it has ever had available to it.
NASA's Maven Enters Mars Orbit: What's Next?
NASA's Maven Enters Mars Orbit: What's Next? (Click image for larger view and slideshow.)
Climate researchers at the Energy Department's Lawrence Berkeley National Laboratory are using the power of the Edison and Hopper supercomputers to run global weather simulations at levels of granularity never before possible.
"We can do calculations I have been waiting my entire career to do," said Michael Wehner, senior staff scientist in the lab's Computational Research Division. "This brings a leap forward. The simulations are much more realistic; the storms are much more interesting."
The supercomputer simulations have produced more than 400 terabytes of modeling data in the last two years, and this creates a challenge of its own, Wehner said. "How to extract something meaningful from this data?"
There are two big challenges to making use of big data. The first is that the ability of supercomputers to create these large datasets has outstripped their ability to use them. It is a problem of input/output, said Wehner. High-performance computers are good at generating data, but not so good at putting it out. They are not designed to take it back in for analysis. "The input is killing us," Wehner said.
[Government's big data initiatives are paying off for cancer research. Read Big Data Disease Breakthroughs.]
"This is not necessarily a new problem," said Steve Wallach, former technical executive at the National Geospatial-Intelligence Agency (NGA). As long as 30 years ago computers were producing more data than could be practically used, and the ability to produce it has outpaced our ability to manage it since then, he noted. "We are moving into a new area," said Wallach.
The other major challenge is making the data available to other researchers who can add value to it. "I spend a lot of the taxpayers' money producing this data with the big machines," Wehner said. "We bend over backwards trying to get this out to other collaborators."
Figure 1: Image: Wikipedia
Making big data accessible requires more than large volumes of storage and high-bandwidth network links. It requires metadata so that data can be searched and located and it requires new techniques for storage and retrieval so that data stored over distributed systems can be found quickly and delivered efficiently to users.
In many cases, "it's pretty much roll your own" in developing tools for more efficient use of big data, Wehner said. Now, the experience of the government's early big data adopters -- especially the Energy Department's national labs and big players in the intelligence community, such as the NGA and NSA -- is trickling into the private sector, which has begun producing commercial tools for big data that can be used by government.
In the past, "industry learned how to do it by working with us," said Gary Grider, who leads the High Performance Computing division at Los Alamos National Lab. "That no longer is entirely true. Some commercial entities are catching up. We have some brethren, which is good for us."
Big data and big bucks
The federal government has been producing and using big data almost from its beginning. The first national census in 1790 gathered information on nearly 4 million people, which Commerce Department Under Secretary for Economic Affairs Mark Doms called, "a huge dataset for its day, and not too shabby by today's standards, as well."
The scale has increased dramatically since then. The United States now has a population of more than 300 million with an annual economy of $17 trillion, and the government's principal statistical agencies spend about $3.7 billion a
year gathering, processing, and disseminating data. A recent Commerce Department report estimates that this data contributes from $24 billion to $221 billion annually to private sector revenues.
But big data differs qualitatively as well as quantitatively from merely large amounts of data. It is characterized not only by its volume, but by its complexity, the fact that it consists of both structured and unstructured data, and that it often is distributed among a variety of storage facilities, sometimes located far apart. This makes it difficult to use traditional data processing on big data. It requires not only more computing power, but new ways to store, search, access, transport, analyze, and visualize it.
Brand Niemann, founder of the Federal Big Data Working Group and former senior enterprise architect and data scientist at the Environmental Protection Agency, advises agencies, "The best way to deal with big data is to start small." Learn how to use distributed structured and unstructured data, and then scale up, he says.
Given the potential value of big data being assembled in government and the private sector and the challenges of using it, the Obama administration in 2012 announced a Big Data Initiative, with more than $200 million committed from six agencies to foster research and development on how to better extract useful information from these large masses of data. Within 18 months of launching the initiative, the National Science Foundation announced about $150 million worth of grants for projects ranging from cancer genome and human language processing to data storage. Also investing in big data programs were the Defense Advanced Research Projects Agency (DARPA), NASA, the U.S. Geological Survey, the National Institutes of Health, and the Energy Department.
Exciting times
The DOE projects are focusing on data management and indexing techniques for large and complex datasets. One effort is the Federated Earth System Grid (ESG), created to provide the climate research community -- both in and out of government -- with access to hundreds of petabytes of simulation data. It is a federated architecture with multiple portals, means of access, and delivery mechanisms. The framework has three tiers: Metadata services for search and discovery, data gateways that act as brokers handling data requests, and ESG nodes with the data holdings and metadata accessing services.
Another Energy Department effort is Toolkit for Extreme Climate Analysis (TECA), for visualizing multi-terabyte climate simulations intended to further understanding of global climate change. The sheer size of datasets being analyzed presents challenges, and they are growing exponentially with improvements in the models and the speed of the computers running the simulations. Conventional visualization and analytical tools use a serial execution model, limiting their use with very large datasets. TECA is a step toward a parallel model for analysis.
These efforts are advancing the specialized area of climate research, Wehner said. Today there are only five groups in the world running these types of simulations; in five years there will be 12. "It's an extremely exciting time for this kind of science," he said.
One of the primary responsibilities of Energy Department labs is overseeing the nation's nuclear arsenal, which relies on computer simulations since the halting of nuclear testing. Grider, who "owns" the high-performance computing center at Los Alamos, has been working with big data since 1971, "building datasets that were far bigger than would fit in anybody's memory."
Over the years, the fidelity required to understand what is going on in nuclear weapons has grown to the point that new storage solutions are required, and the lab will be installing two to three petabytes of memory in the next year, part of a system that eventually will store from 200 to 500 petabytes.
But "it's bigger than just storage," Grider said. Dealing with these volumes of data requires not only capacity, but the bandwidth to access the data and tools to manage it.
Los Alamos will be using Scality's Ring software-defined storage system. The system provides massively scalable object storage that is hardware agnostic, so the lab can use any medium it wants in the system. That was one of the attractions of the Scality Ring system, said Leo Leung, Scality's head of corporate marketing. "They wanted to choose the hardware later. They wanted to separate that decision."
Scality's Ring software provides centralized management of distributed storage, without bottlenecks or a central point of failure. It uses erasure coding, a technique that is used in cloud storage to protect data and make it
available by breaking it into pieces and encoding each piece with redundant metadata. It provides increased availability, said Leung. "The metadata is tiny," so it doesn't use up much storage space. "It takes up far less than 1% of the capacity."
Erasure coding lets the lab take advantage of commercial technology developed for cloud computing that DOE didn't have to develop itself, Grider said.
The lab will use disks as the storage medium, which can keep a significant portion of a project's data online for months while work is being done with it. Flash memory is good for this type of storage and it is fast but expensive. "We can't afford to have a lot of that," Grider said. Tape is cheap, but a parallel tape system is infeasible at petabyte scales. It can take a hundred tape drives days to write out a petabyte. Disks are not quite as cheap as tape, but they are a good compromise between cost and speed.
"That is not to say that tape will leave our environment entirely," Grider said. It still is a good medium for long-term storage when data is not being actively used, he noted.
We've only begun
With a federal initiative to make big data available and to make use of it, the topic is getting a lot of attention in agencies, but to date it has produced more interest than results, said former NGA exec Wallach.
"It's relatively new," he said. Best practices and tools still are being developed. "There is progress and people are working hard at it," but it is too early to see many mature products.
If anyone has had practical experience in using big data it probably is the intelligence community, which has for years been looking for correlations and anomalies in large amounts of data. As the volume of data grows, these tasks become like trying to spot a tree in a forest or a needle in a haystack. The National Security Agency has for some time needed tools to spot trees and needles, said Jasper Graham, former NSA technical director.
"You don't want to wait for a massive event to occur" before looking for the indicators, Graham said.
When looking at a massive data stream, such as the .mil networks that NSA is responsible for, the right tools can accumulate small occurrences to the point that they become significant long before they could be spotted manually. "When you have big data, the small trickle starts to look like a big hole in the dam," Graham said. "It really starts to stick out."
This allows network defenders to move beyond traditional security techniques, such as signature and malware detection, and look for activity that indicates the initial stages of an attack or intrusion. Doing this requires first normalizing data so that outliers can be spotted that could indicate an adversary doing reconnaissance and preparing the ground for malicious activity. Agencies and industry are cooperating to develop the algorithms to allow critical indicators to "bubble up" and become visible within big data, Graham said. "Big data tools are getting much better."
One example of this cooperative effort is the Accumulo data storage and retrieval system, developed originally by the NSA in 2008 and based on Google's Big Table database. NSA released it as an open source tool in 2011 and it now is managed by the Apache Software Foundation. Apache Accumulo is built on top of Apache Hadoop, Zookeeper, and Thrift.
Accumulo enables the secure storage and retrieval of large amounts of data across many servers, with access controls based on the classification level of each piece of information. It is a type of NoSQL database that operates at big data scales in a distributed environment. What distinguishes Accumulo from other big data storage tools is the ability to tag each data object to control access, so that different kinds of data with different classification levels can be stored in the same table and be logically segregated. This cell-based access control is more fine-grained than most databases. This collaboration between government and the private sector is making the use of big data practical in other arenas, such as medical research and fraud detection in government benefit programs.
But we have yet to see the full potential of it, Graham said. "I think it's getting there, but there is huge room for improvement," he said. "We're at the tip of the iceberg."
Find out how NASA's Jet Propulsion Laboratory addressed governance, risk, and compliance for its critical public cloud services. Get the new Cloud Governance At NASA issue of InformationWeek Government Tech Digest today. (Free registration required.)
About the Author
You May Also Like