The Obama Administration last week unveiled a "Big Data Research and Development Initiative" that will see at least six government agencies making $200 million in additional investments to "greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data."
The big data initiative sounds good in theory, and I'm all for promoting U.S. competitiveness in math and science. But after sitting through nearly two hours of presentations on the feds' big data initiative, I fear those investments will be spread too thinly among too many agencies that aren't collaborating.
It's encouraging that the White House is at least aware of all the agencies involved in data- and compute-intensive research. The administration released a fact sheet that listed at least 80 projects and initiatives across a dozen federal agencies, including the Department of Defense, Department of Homeland Security, Department of Energy, Health and Human Services, and Food and Drug Administration.
[ Want more on the government's big data plan? Read White House Shares $200 Million Big Data Plan. ]
Who knew the government was funding so much data-driven research? The White House issued this fact sheet as if to say, "Look how much we're doing already!" But when you start reading about all the separate initiatives and all of the high-performance computing labs and research facilities already in place, it makes your head spin. As a taxpayer, it pains me to see so many examples of apparently duplicative research, staff, and infrastructure.
The big data initiative was prompted in part by a December 2010 report by the President's Council of Advisors on Science and Technology (PCAST) on "Designing a Digital Future," which found the U.S. is investing too little in networking and IT research. Part of the reason we're not spending "enough" is that we're spreading investments among agencies conducting R&D for their respective fields rather than on networking and IT that could benefit everyone.
It was a good sign that last week's presentation kicked off with the announcement of an initiative between the National Science Foundation and National Institute of Health to fund 15 to 20 research projects to the tune of $25 million. The idea behind this Big Data Solicitation is to seed and provide direction for initiatives that will speed data-driven scientific discoveries related to health and disease. What's more, it's an invitation to academia, non-governmental organizations, and the private sector to participate. This is exactly the kind of collaborative effort I think we need.
But after a promising start, the four speakers who followed--from the U.S. Geological Survey, the Department of Defense, the Defense Advanced Research Projects Agency, and the Department of Energy--seemed more intent on talking about their unique initiatives and less focused on how they could collaborate with other agencies. Amid the din of acronyms and price-tag-unknown projects, the same terms kept coming up: data volume, data variety, modeling and algorithms, data visualization, making information actionable, and so on.
It all reminded me of a conversation I had with Don Burke a couple of years ago on the topic of the lack of cooperation, collaboration, and consolidation among government agencies involved in national security. "Every agency says, 'I have unique needs.' Then their IT providers say, 'I will give you the 100% solution for that need, but you have to give us all this money to create a unique solution,'" explained Burke, "doyen" of Intellipedia, an intelligence-community-wide wiki started in 2006 by the Office of the Director of National Intelligence.
Intellipedia aims to help the intelligence community connect the dots on threats by collapsing the walls between data silos. Reading through all the big data projects and initiatives the government already has on the table, I think there's an opportunity to do more shared big-data research and create shared big-data platforms.
Yes, the U.S. Geological Survey, NASA, the Department of Defense, and the National Institute of Health are doing very different types of data-driven research and analyses, but they're all grappling with the use of unstructured data and large-scale machine data, they're all pushing the envelope on data mining, and they're all looking for better data visualization and reporting techniques.
Johns Hopkins, for one, believes in big data collaboration across disciplines. Dr. Peter Greene, Johns Hopkins' chief medical information officer, tells me that that institution's oncology researchers are collaborating with the university's Department of Astronomy. The cancer researchers face the big data challenge of studying the human genome, which consists of 3 billion base pairs of DNA. Johns Hopkins' Department of Astronomy, meanwhile, has a data center with rack upon rack of compute power applied to large-scale computational astronomy calculations. Why build a separate data center when one can handle both astronomy and healthcare calculations?
The government's hugely important data center consolidation plan didn't come up at all during last week's announcements. So what about assessments of compute-power requirements and staffing needs? Are our current labs anywhere near maximum utilization? It strikes me that consolidating high-performance computing centers and relying on cloud delivery of services to multiple agencies could go a long way toward cutting the big cost of big-data analysis.
If we're to avoid the problem identified in the original PCAST report--spreading budgets too thinly across too many agencies studying parochial requirements--these departments and agencies must recognize that there's a huge opportunity for their research dollars to go further. If they will only give up a bit of control and a bit of their "unique" agendas and a bit of their precious budgets, we could be creating big data research and systems for the common good.