The data warehouse is an ambitious effort to pull clinical and scientific data into a single system, giving researchers an unprecedented opportunity to study the relationship between genes, proteins, and disease. The database will collect as much as 50 terabytes of data every nine months and over time could become the largest data warehouse in the world.
"No one has put all this information onto a single database platform," says Dr. Richard Somiari, chief operating officer and chief scientific officer at the institute, based in Windber, Pa. The system is based on data-warehouse hardware and software from NCR Corp.'s Teradata division. Details about the project are being disclosed this week at Teradata's user conference in Seattle.
Clinical data, from patients at the Windber Medical Center with which the research institute is affiliated, already has been loaded into the data warehouse. That includes data from tissue biopsies (each of which adds 166 Mbytes of data to the system), family histories, radiology (including X-ray images) and histopathology data, and patient DNA, RNA, and protein information.
The next step will be to add data from other research databases, including DNA data from GenBank, protein data from the Swiss-Prot database in Europe, metabolic pathway data from Kyoto University's KEGG (Kyoto encyclopedia of genes and genomes) database, and protein-protein interaction data from the DIP (database of interactive proteins) database at UCLA.
Linking this basic research data with clinical information will allow researchers at Windber to examine multiple variables when investigating the causes of disease, Somiari says. The goals are to develop new strategies for managing patient conditions, discover new "markers" that help doctors diagnose diseases much earlier, and ultimately develop cures for the diseases.
Windber chose the Teradata system because of its scalability and parallel processing capabilities, says Nick Jacobs, the institute's president and CEO. He adds that Windber sought the same kind of technology that Wal-Mart and other commercial companies use to build their own massive data warehouses. The system uses analysis tools from Amersham Biosciences, Genomax Technologies, and Spotfire. Partners in the project include the U.S. Army's Walter Reed Army Medical Center, universities such as the University of Pennsylvania and Creighton University, and research institutes in the U.S., Europe, and Japan.