Web marketers probe clickstreams to find best prospects and most-profitable customers. Telcos examine call data records and social-network comments to identify influencers and potential defectors. And then there's Harvard Medical School, a big-data practitioner that is probing more than 20 year's worth of medical records to study the effectiveness and risks of various drugs.
Not to take anything away from the innovation and value of strictly commercial uses of big data, but Harvard's work holds lives in the balance. For example, Harvard's drug research spotted risks of teen suicide tied to certain antidepressants. Another study led to the withdrawl of a drug used in cardiac surgeries because it was shown to have higher risks than two safer alternatives. And yet another study done by the medical school led to new FDA warnings about a risk of violence among older patients using certain psychoactive drugs.
Harvard Medical research teams have been involved in pharmacoepidemilogy and pharmacoeconomic research for more than 13 years, and it has always been a data-intensive endeavor. With medical data now experiencing the same sort of explosive growth seen in other industries, Harvard recently stepped up to a state-of-the-art data warehousing appliance from IBM Netezza.
Most of Harvard's research centers on data from Medicare, Medicaid, and commercial insurance claims. Personally identifiable information is removed from this data before it gets into Harvard's hands, but there's still plenty of research-relevant detail on age, gender, race, previous conditions, indications, treatments, outcomes, and so on.
The medical school's databases have steadily grown over the years, and it has kept pace with a succession of technology deployments. As of last year, Harvard had multiple conventional relational database deployments running on big, proprietary servers running Unix. The total store covers more than 10 million patients and exceeded 15 terabytes of data, but with growing data and new techniques, the school knew it needed a faster and more capacious data warehousing and analytics platform.
"We were at breaking point, but it wasn't so much the quantity of the information as the richness of the data and the need to apply iterative algorithms," says Dr. Sebastian Schneeweiss, associate professor of medicine at Harvard Medical School and vice chief of the Brigham & Women’s Hospital.
The number of patient records needed for a particular drug test might be the same as required 10 years ago, but by richness, Schneeweiss means that for each patient record there are many new measures, such as body-mass index and LDL cholesterol level, that weren't tracked a decade ago. The iterative algorithms applied by Harvard's researchers are used to develop what are known as high-dimensional propensity scores, which cancel out biases in the sample by identifying relevant risk factors.
"If a patient has a high lipid test [LDL] level, for example, it's more likely they will take a lipid-lowering medication, and at the same time it's more likely that they will have a heart attack," Schneeweiss explains.
LDL levels are just one risk factor out of hundreds that are identified and prioritized by high-dimensional propensity scores. It takes time to develop and run the algorithms, and that gets back to the capacity and speed of the analytics platform. Without elaborate and time-consuming database tuning and optimization work, researchers found that many of their iterative algorithms took as long as overnight or a weekend to run.
"By 2009 we recognized that we needed a fundamentally different approach," says Schneeweiss.
The different approach embraced by the commercial world for big-data processing has been massively parallel processing appliances built on commodity (mostly Intel X86) servers rather than clusters of expensive proprietary symmetric multiprocessor servers. Harvard didn't have to look far to find such an appliance as it was approached by IBM Netezza, headquartered in nearby Marlborough, Mass., in 2010 to explore the possibility of a research partnership.
(Competitors will undoubtedly point out that Netezza still uses proprietary Field Programmable Gate Arrays for data filtering, but the company switched to commodity X86 processors and storage in 2009 with the move to its TwinFin architechture .)
Appliances are typically a seven-figure investment, but through the partnership, Harvard did not have to pay for its appliance. "That explains why we didn't shop around -- it was a Godsend that came at the right moment," says Schneeweiss.
The transition to IBM Netezza happened quickly early this year, as IBM Netezza had a TwinFin appliance up and running at a Harvard research data center within two days. Once data was migrated to the new environment, Schneeweiss says the school's six programmers were able to do analyses at least ten times faster without any optimization.
"We have one analysis of data on 150,000 patients that took 20 minutes, with optimization, in the old environment, and it now takes two seconds without any special tuning," he says.
Given the faster analysis speeds and minimal tuning now required, researches now routinely apply high-dimensional propensity scoring techniques to improve the accuracy of their research. "That gets us that much closer to causal conclusions, and researchers can act upon that insight," Schneeweiss says.
The faster Harvard's researchers can develop conclusive research, the sooner they will be able help drug companies, the FDA, and other regulatory agencies take risky drugs off the market and steer practitioners toward the safest and most effective medications available.
For IBM Netezza, promoting the use of the company's technology among prestigious researchers helps opens doors at other research facilities and at commercial firms, such as pharmaceutical giants. "We at Netezza are excited that our collaboration with these notable Harvard Medical School faculty and researchers has already led to leveraging IBM research development efforts and existing products toward revolutionizing computational pharmacoepidemiology," wrote Shawn Dolley, vice president and general manager of the Healthcare & Life Sciences practice at IBM Netezza.
It's the kind of good-will gesture that has always paid off for IBM, even if means giving away a million-dollar-plus appliance.