Creating a new drug can take anywhere from 8 years to 20 years for a pharmaceutical company. To put that in perspective, on the early end of that timeline, the iPad was introduced 8 years ago in 2010. And on the far side of that timeline, Google didn't exist 20 years ago -- it was founded in September 1998. Technology is moving along much faster than new drug development for pharmaceutical companies.
GlaxoSmithKline (GSK) would like to make that new drug discovery and development timeline a lot shorter. The London-based company is among the largest pharmaceutical companies in the world. Mark Ramsey joined GSK in 2015 as the pharmaceutical giant's first chief data officer (CDO), reporting to the president of research and development and focused on creating a team and establishing a platform to support data and analytics for the company. Ultimately, the goal is to speed up the drug development process to 2 years.
So where do you begin to address the data and analytics challenges presented by a centuries old company and leapfrog ahead to a place where efficiency can accelerate drug development?
"What we didn't want to do was to build a single use case," Ramsey told InformationWeek in an interview. Ramsey said that he's seen organizations have trouble expanding the initial solution when they start too small, even though many analysts recommend starting small when implementing an initial analytics project.
Instead, Ramsey began with performing an inventory across R&D and the portfolio of use cases to get a sense of everything that his program would touch. The approach was an important first step in building the program that would break down the data-flow barriers among the companies many siloed operations. For instance, the clinical trial area is a silo. Experiments by scientists are a silo. Lessons from other organizations are a silo. The inventory project became the foundation for designing an architecture and approach for the entire organization.
"There's a lot of discussion around machine learning, artificial intelligence, and deep learning," Ramsey said. "But you need the data in order to be able to feed those technologies."
Ramsey's group focused on bringing that siloed data together. "We are now delivering collections of data and collections of use cases," he said.
Ramsey's data and analytics stack includes multiple technologies, with the foundation based on Cloudera's Hadoop.
"That's our primary data and information platform -- the source where we store our curated data and our analytics processes." The stack also includes Kafka and Spark. Other technologies include StreamSets for data ingestion (which has been completely automated with bots), Tamr for machine learning data curation, Trifacta for data wrangling, and AtScale for virtualization across environments. AtScale lets users leverage familiar BI tools for insights from the Hadoop environment. GSK also uses Zoomdata for data visualization, Docker for some of its containerization, Kinetica for GPU-based analytics, and Waterline Data for storage and search. The total solution amounts to more than 5 petabytes of data, all on-premises.
"It's still a way too complex environment," he told me. "We are really working with each of those organizations so that meta data and interoperability come together." The goal, of course, is to "really bring them together as a well-integrated ecosystem."
Users across the enterprise are consuming all this different data in different ways. Ramsey said that a large number of people access the data through guided analytics in the form of a structured query or dashboard.
GSK also has about 500 to 600 "bench chemists" who have been using an Excel plugin for many years to get data about experiments.
Another 20% of the organization uses Python, R, or other analytics tools to leverage a computational notebook. They are focused less on visualization and more on developing routines that run against the data. Another 10% of staff are using the platform for machine learning and deep learning -- running simulations and algorithms.
"One of the big challenges is that even though Hadoop and related technologies have been in the market for a while, bringing them all together is more difficult than what I think it could be," Ramsey said. "I think that's one of the reasons we don't see a lot of production-level Hadoop on a larger scale. It's more difficult to make it happen than it should be, and that is putting a constraint on the industry."
GSK hasn't achieved 2-year drug development yet, but the data and analytics platform environment created by Ramsey and his team has brought the company closer to realizing that goal.