Web marketers probe clickstreams to find best prospects and most-profitable customers. Telcos examine call data records and social-network comments to identify influencers and potential defectors. And then there's Harvard Medical School, a big-data practitioner that is probing more than 20 year's worth of medical records to study the effectiveness and risks of various drugs.
Not to take anything away from the innovation and value of strictly commercial uses of big data, but Harvard's work holds lives in the balance. For example, Harvard's drug research spotted risks of teen suicide tied to certain antidepressants. Another study led to the withdrawl of a drug used in cardiac surgeries because it was shown to have higher risks than two safer alternatives. And yet another study done by the medical school led to new FDA warnings about a risk of violence among older patients using certain psychoactive drugs.
Harvard Medical research teams have been involved in pharmacoepidemilogy and pharmacoeconomic research for more than 13 years, and it has always been a data-intensive endeavor. With medical data now experiencing the same sort of explosive growth seen in other industries, Harvard recently stepped up to a state-of-the-art data warehousing appliance from IBM Netezza.
Most of Harvard's research centers on data from Medicare, Medicaid, and commercial insurance claims. Personally identifiable information is removed from this data before it gets into Harvard's hands, but there's still plenty of research-relevant detail on age, gender, race, previous conditions, indications, treatments, outcomes, and so on.
The medical school's databases have steadily grown over the years, and it has kept pace with a succession of technology deployments. As of last year, Harvard had multiple conventional relational database deployments running on big, proprietary servers running Unix. The total store covers more than 10 million patients and exceeded 15 terabytes of data, but with growing data and new techniques, the school knew it needed a faster and more capacious data warehousing and analytics platform.
"We were at breaking point, but it wasn't so much the quantity of the information as the richness of the data and the need to apply iterative algorithms," says Dr. Sebastian Schneeweiss, associate professor of medicine at Harvard Medical School and vice chief of the Brigham & Women’s Hospital.
The number of patient records needed for a particular drug test might be the same as required 10 years ago, but by richness, Schneeweiss means that for each patient record there are many new measures, such as body-mass index and LDL cholesterol level, that weren't tracked a decade ago. The iterative algorithms applied by Harvard's researchers are used to develop what are known as high-dimensional propensity scores, which cancel out biases in the sample by identifying relevant risk factors.