How Synthetic Data Accelerates Coronavirus Research

When your research could save COVID-19 patients, you don't want to wait around for institutional approval to use patient data in research. Here's an alternative.

Jessica Davis, Senior Editor

August 7, 2020

5 Min Read
Image: terovesalainen -

In the midst of a crisis, quick action is often necessary to prevent greater damage. But when you operate in an environment or industry governed by many rules and regulations, quick action can be pretty difficult.

Such is the case with healthcare research. Plenty of data is gathered every day about patients -- their age, gender, ethnicity, underlying health conditions, and more. But the data is sensitive and protected. After all, it's some of the most personal data there is about people.

Now imagine you are a healthcare researcher working on issues around the COVID-19 pandemic. That data is valuable and being able to work with it quickly means finding answers faster and potentially saving more lives.

"If you look at the traditional way that we access patient data for research and innovation purposes, it tends to be quite cumbersome and not particularly timely," said Philip Payne, chief data scientist and associate dean for health and data science at Washington University School of Medicine in St. Louis. "That's because there's a very complex set of regulatory hurdles as well as technical hurdles."


Those carrriers include the need to maintain the privacy and confidentiality of patients. But modern data analytics that require a lot of iterations call for researchers to request and wait for data. Researchers may have to go back to governing bodies to get access to additional data, and that can take weeks or months. The protected status of patient data makes it hard to do data analytic research in a way that can be applied in a quick, agile way to impact a rapidly evolving crisis like the coronavirus pandemic.

Speed matters in a pandemic. Rules designed to protect patient privacy slow it all down to a crawl. But you can't throw those rules out the window, either.

To access data at the speed required while also respecting the privacy and governance needs of patient data, Washington University at St. Louis, Jefferson Health in Philadelphia, and other healthcare organizations have opted for an alternative, using something called synthetic data.

Gartner defines synthetic data as data that is "generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world."

Here's how Payne describes it: "We can take a set of data from real world patients but then produce a synthetic derivative that statistically is identical to those patents' data. You can drill down to the individual role level and it will look like the data extracted from the EHR (electronic health record), but there's no mutual information that connects that data to the source data from which it is derived."

Why is that so important?

"From the legal and regulatory and technical standpoint, this is no longer potentially identifiable human subjects' data, so now our investigators can literally watch a training video and get access to the system," Payne said. "They can sign a data use agreement and immediately start iterating through their analysis."

For more on data in the enterprise, read:

How Machine Learning is Influencing Diversity & Inclusion

Why Data Science Isn't an Exact Science

How COVID is Changing Technology Futures

Will Facial Recognition Thrive in the Post-Pandemic Economy?

In the case of Washington University and Jefferson Health, researchers are using a platform for synthetic data called MDClone that specializes in synthetic data in healthcare. This platform takes real patient data and examines the statistical distribution of things that define those patients. The statistics about real patients are carried forward into the synthetic data set. The platform essentially creates a simulated set of patients. Researchers are able to begin data analysis work using the synthetic data after an hour-long training session and signing a data use agreement. That compares to weeks or months required when researchers need to get approval from an institutional review board to use actual patient data.

That speed is essential when you are racing for new insights about a novel coronavirus that has already killed more than 150,000 people in the United States and more than 700,000 people around the world. Researchers are racing for a vaccine and treatments.

For Washington University in St. Louis, the data team was able to recognize another important trend about patients in the health system's network of 15 hospitals and two physician groups. The team was looking at the anticipated maximum patient load, how many patients would require the ICU, how many would require ventilators, how many would require dialysis, and the personnel required for all this.

The team was able to quickly realize that its hospitals in north St. Louis were seeing greater rates of admissions and ICU admissions among COVID-19 patients. A data analysis revealed that African Americans were about 2.5 times more likely to be admitted to the hospital than any other patient group, Payne said. Once admitted, Black patients' odds of ending up in the ICU were four times greater than those of other patient populations.

Payne said that insight led to working with public health groups to better support communities at risk.

Washington University is using MDClone in its cloud-first Microsoft Azure implementation, but MDClone can also be deployed on-premises.

To further COVID-19 research and other advanced health work, last month MDClone announced The Global Network, a research and knowledge-sharing collaborative that protects patient privacy through the use of synthetic data. The Global Network will focus on three pillars of research in its first year -- health services, clinical medicine, and precision medicine. At launch members included Washington University, Jefferson Health, and Intermountain Healthcare in the western states, among several others. The network enables collaboration across these medical organizations, which is something that can accelerate and improve research.

"Synthetic data can remove restrictions to sharing data externally so you can innovate faster," said Josh Rubel, chief commercial officer at MDClone.

About the Author(s)

Jessica Davis

Senior Editor

Jessica Davis is a Senior Editor at InformationWeek. She covers enterprise IT leadership, careers, artificial intelligence, data and analytics, and enterprise software. She has spent a career covering the intersection of business and technology. Follow her on twitter: @jessicadavis.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights