The point of big data is to be able to extract usable information -- knowledge -- from large volumes of data that do not have any immediately apparent relationships. Even with advances in computing power, the task of searching to find correlations can be daunting and even impractical if the datasets are large enough.
The National Institutes of Health has enabled semantic searches of the data in its Medline database, allowing researchers to find correlations in published medical data between therapies and outcomes that had not been noticed before. In one case, cancer researchers using graph analysis were able to see that in some types of cancer cases immunotherapy produced better results than chemotherapy.
"It's a real discovery," said Brand Niemann, founder of the Federal Big Data Working Group and former senior enterprise architect and data scientist that the Environmental Protection Agency. "It's like finding a needle in the haystack of medical literature."
[Learn more about how government is boosting big data. See Government Toils To Create Big Data Infrastructure.]
The haystack is Medline, the bibliographical database of the National Library of Medicine, which contains more than 21 million references to medical journal articles dating back to 1946. The database contains an embarrassment of riches, with 2,000 to 4,000 new references added daily, five days a week, in 2013 alone. These entries have been enhanced with 65 million semantic predications -- entries using semantic markup standards -- resulting in 2.2 billion Resource Description Framework statements.
To make the search practical, researchers used the Urika graph analytics appliance from YarcData. Urika works with existing data warehouses to handle graph workloads, which allow relationships within the data to be plotted graphically. All resources to be searched are stored on the appliance's shared memory, so data does not have to first be partitioned or formed in data models. The team was able to identify connections between outcomes of therapies for different types of cancers from the 10 million semantic predications.
By creating a practical way to extract visual relationships from the data, the researchers were able to find the correlations quickly and without first developing a hypothesis about them. Making the data semantically searchable enables analysis that can make better use of existing data to drive future research, Niemann said.
The owners of electronic health records aren't necessarily the patients. How much control should they have? Get the new Who Owns Patient Data? issue of InformationWeek Healthcare today.