That's presenting new challenges for budget-constrained government agencies that already process large volumes of data for predictive analysis, scientific visualization and forecasting. The concern for managing an exponential rise in data is compounded by the Obama administration's push to release more data to the public in response to the administration's Open Data Policy -- especially those working in the scientific and health fields.
"The challenge is not so much processing what data we have, but what's coming," said Roger Baker, chief strategy officer for Agilex and a former CIO for the Department of Veterans Affairs. Baker spoke during a panel discussion Monday at the annual American Council for Technology -- Industry Advisory Council Executive Leadership Conference.
The National Oceanic and Atmospheric Administration (NOAA) offers a glimpse of the volume of data that some agencies already handle. NOAA collects more than 2 billion observations from 17 satellites, and another 1.5 billion observations from sensors around the world each day, according to Joe Klimavicz, CIO for NOAA, who spoke at the conference. The agency relies on supercomputers capable of processing 2 million billion calculations per second to analyze that data and produce the 15 million weather and related reports NOAA issues throughout each day, Klimavicz added.
[ "Lowest price technically acceptable" contracts sacrifice long-term value for short-term savings. Read LPTA Contracts Stifle Government Innovation. ]
"Our data associated with this is growing 30 petabytes a year," he said, noting that the agency's dependency on big data tools and its dependency on the infrastructure to carry, store and process that data continues to grow along with that data.
Klimivicz outlined a number of additional concerns his office is now confronting as big data volumes grow even bigger. One is having the controls in place to validate the origin and the accuracy of the information streaming into NOAA. "Metadata quality is incredibly important to us," he said. So is ensuring data integrity. "If enough bits randomly flip from zero to one, that can begin to impact our climate models."
The sheer volume and diversity of data sources agencies NOAA deals with introduces another layer of concern for CIOs like Klimavicz. He explained how a program begun with the FAA more than two decades ago, to collect in-flight weather data from US airlines, generates more than 100,000 automated reports a day but took nearly two years to fully assimilate into NOAA's weather models to ensure the data dovetailed correctly with other data sources.
Getting agreement within communities of interest how to define data and how that data is eventually used remains yet another challenge.
Department of Energy CTO Robert Bectel said developing the semantic architecture and a system for federating data can get costly. "The real win for me is if I can get a community of scientists to make sure the data coming in is accurate and consistent."
Creating data standards proved to be a major undertaking in the health field, according to Baker, solved in part by the development of a health data dictionary agreed to by the VA and Defense Department. Data standards play a critical role as data gets aggregated, Baker said. "We found 240 ways to represent the way penicillin was prescribed," he said, explaining the level of work that is often involved in making seemingly identical data make sense in aggregate form.
Perhaps just as important to defining data is defining how data customers use the data. "Real-time depends on what you're talking about," Baker pointed out. "We thought getting information (to users) in 24 hours was okay." But Defense Department doctors expected to see data minutes after a patient's labs were completed.
"Health data is moving same way as science data," he added, explaining that the Centers for Disease Control and Prevention is using big data to model the movement of diseases the same way NOAA models climates.