Hope you're not tired of buzzwords. After "the network is the computer" and "the cloud", welcome to "data-intensive computing". This time, however, there's far more at work here than a clever turn of phrase.
On Monday, the New York Times took a page out of their Science section to talk about Dr. Jim Gray, a software engineer and researcher and former Microsoft researcher. Two years ago, he vanished off the coast of California in his yacht and was presumed dead, but left behind a major body of work in the field of mass information analysis.
His point of view was that scientists in general, not just computer scientists, are best served by creating systems that are designed from the inside out to process and help people visualize massive amounts of data efficiently. We live in a world where terabytes, petabytes and now exabytes of data are routinely generated . Not just by scientific research, but that's one of the best fields for applying these insights, and certainly one of the most fruitful.
Gray's work is about to get a major boost in the form of a collection of essays that expand greatly on his insights and premises. The book is named The Fourth Paradigm: Data-Intensive Scientific Discovery, and it's free. Not just in the sense that you can download the full PDF-format text of the book at the link, but free in its licensing. It's been published under the Creative Commons license, which allows people to re-use it in any number of contexts as long as they provide proper citation for the original work.
I can see this being used in any number of statistical or computer-science courses because of that. Who's going to say no to an insightful book about a timely subject that doesn't cost a dime to use, and is packed with things that would be more than worth paying for in a print edition?
The book's main focus is how we get and deal with volumes of data that form the axis upon which crucial research revolves:
The essays [in the book] focus on research on the earth and environment, health and well-being, scientific infrastructure and the way in which computers and networks are transforming scholarly communication. The essays also chronicle a new generation of scientific instruments that are increasingly part sensor, part computer, and which are capable of producing and capturing vast floods of data. For example, the Australian Square Kilometre Array of radio telescopes, CERN's Large Hadron Collider and the Pan-Starrs array of telescopes are each capable of generating several petabytes of digital information each day, although their research plans call for the generation of much smaller amounts of data, for financial and technical reasons.
Good scientific work requires that your results be reproducible. At least part of that is the sharing of data: if your number-crunching is suspect, other people can take your raw data and attempt to achieve the same results. Gray's work -- and those of his colleagues -- point towards a future where scientific work that involves such data sets is not only more automatic, but far more of an open process than it is now.
This isn't a matter of convenience, either. Our future survival as a species may depend on it. I'd wager it already does. The more of it that's done out in the open, and the more tools we have to enhance that process, the better.
Our "A New IT Manifesto" report looks at a variety of new approaches and technologies that let IT rebels take on a whole new role, enhancing their companies' competitiveness and engaging their entire organizations more intimately with customers. Download the report here (registration required).