The company announced Thursday that it would work with the University of California at Berkeley, Cornell University, the University of Massachusetts Amherst, and Carnegie Mellon University on new applications to analyze Internet-scale data sets and other large-scale systems software research. Yahoo will give the universities access to its cloud-computing cluster, also known as M45.
The cluster has about 4,000 processor-cores and 1.5 PB of disk space. It runs the Apache Software Foundation's open source distributed file system, Hadoop, and enables its users to process massive amounts of data. Yahoo's engineers have been the primary contributors to Hadoop, which powers Yahoo's Web search, content optimization, and other systems.
Shelton Shugar, senior VP of cloud computing at Yahoo, said investment in open source technologies helps Yahoo achieve breakthroughs in Internet-scale computing and improve user experience.
"By partnering with these top educational institutions to share our M45 cluster and our technical expertise, we hope to further key insights into the next generation of systems software research and development," he said.
In July, Yahoo joined Hewlett-Packard, Intel, the University of Illinois at Urbana-Champaign, Singapore's Infocomm Development Authority, and Germany's Karlsruhe Institute of Technology to create Open Cirrus, a multi-data center, open source test bed for cloud computing research and education. The partnership promotes collaboration among the private sector, academia, and governments.
The partnership with Illinois also involves the creation of a cloud-computing cluster for the entire National Science Foundation academic community. The four universities that recently joined Yahoo's cloud-computing research efforts will become part of the Open Cirrus community.
Shankar Sastry, dean of the College of Engineering at UC Berkeley, said in a statement released Thursday that gaining access to the cluster would allow researchers to analyze "vast amounts of societal-scale information on the Web." That includes voting records, online news sources, polling data, and economic statistics.
Cornell researchers hope to use access to the cluster to promote wildlife preservation and biodiversity, balance socioeconomic needs and the environment, and encourage large-scale deployment and management of renewable energy sources, said Bob Constable, dean of the faculty of Computing and Information Science at Cornell University.
"We recently established the Institute of Computational Sustainability at Cornell to focus on computational problems in these areas, and Yahoo's cluster will help us solve large-scale optimization and machine learning problems to find better ways to manage our natural resources," he said in a statement.
Jim Kurose, dean of College of Natural Sciences and Mathematics at the UMass Amherst, said the supercomputing cluster would allow researchers to conduct research on a large set of scanned books drawn from the Internet Archive's million-book collection, which includes 8.5 TB of text and half a petabyte of scanned images.
"Research on such large datasets would not be possible without the use of clusters like the one Yahoo is offering us access to," he said.
Carnegie Mellon has been using the cluster for more than a year and researchers there have published more than two dozen academic papers as a result, said Randal E. Bryant, dean of the School of Computer Science.
"We were also able to conduct research over a corpus of 200 million Web pages, processing two orders of magnitude more data," Bryant said.
Each year, InformationWeek honors the nation's 500 most innovative users of business technology. Companies with $250 million or more in revenue are invited to apply for the 2009 InformationWeek 500 before May 1.