OpenFDA Backstory: Breaking The Paperwork Backlog

The startup Captricity uses a combination of crowdsourcing and OCR to digitize mountains of paper records, particularly for government agencies and healthcare.

David F Carr, Editor, InformationWeek Government/Healthcare

June 5, 2014

5 Min Read

Shreddr breaks form images into individual fields for processing.<br />(Source: Captricity)

Crowdfunding The Next Healthcare Hit

Crowdfunding The Next Healthcare Hit (Click image for larger view and slideshow.)

Kuang Chen can't claim credit for the OpenFDA cloud API announced this week, but his company, Captricity, played a role in making it more interesting.

"The thing about an open data set is, if it's not totally complete, it's not as useful," the Captricity CEO and founder said in an interview. To make it complete, the FDA needed to be able to provide structured access to data, regardless of whether it was submitted online in a structured format such as XML.

The backlog of paper records and scanned forms is a common problem in government and healthcare, Chen said, particularly where IT budgets are tight. The FDA can put demands on drug companies to submit data online, but the adverse drug event reporting database that OpenFDA is starting with includes data from sources at all economic levels and degrees of technical sophistication, including physicians and physician assistants.

Captricity also contributed to a recent project digitizing campaign finance reports for the state of Georgia.

[Wanted: Your health data. Read Mobile Health Devices: Public Health Trend Spotters?]

OpenFDA is one of several open data initiatives announced this week by branches of the US Department of Health and Human Services, including updated and expanded Medicaid data. The innovative cloud service, which makes widely available data previously accessible only to select contractors, is already attracting the attention of mobile app and other developers.

The first OpenFDA service provides access to reports of adverse drug interactions. Roughly 10% of these are submitted on paper or by fax or scanned image, meaning they required manual processing. Captricity's role was to whittle down the backlog of those reports using a combination of crowdsourced human intelligence and optical character recognition (OCR) software. What the FDA gets back is structured data, organized into the XML standard for adverse event reporting.

FDA chief health informatics officer Taha Kass-Hout was impressed enough to participate in a marketing case study for Captricity. He attested that the service "allows us to upload scans of reports received via mail, fax or PDF and get back structured, machine-readable data that is remarkably accurate, even for free-form handwriting."

Chen said OCR alone is not sufficient to produce those results, particularly for data scribbled on to a form by hand. Employing Amazon Mechanical Turk, a marketplace for farming out small "human intelligence tasks" to online workers at a low cost, Captricity is able to quality check OCR results

and read in data that OCR can't handle at all. To keep in compliance with HIPAA in healthcare and other requirements for private or otherwise sensitive data, the service "shreds" the images, so that crowdsourced human workers see only isolated fields from a form, rather than the whole thing -- just the first name, just the last name, or just the middle two digits of a Social Security number, for example.

Figure 1: Shreddr breaks form images into individual fields for processing.
(Source: Captricity)

Another tactic is to use multiple OCR engines, including one of Captricity's own design, and "vote them together" to find the best translation from image to data. The Shreddr API allows application developers to automate the submission of images and get back data in XML or other structured formats. Captricity also works with third parties who will collect boxes of paper for scanning. For the Georgia campaign finance project, the reports were submitted to the cloud service through an e-fax gateway. The FDA already had an internal team to scan documents, but instead of storing images alone, that team began submitting them to the Captricity service.

The technology is not limited to applications in healthcare, Chen said, but "that's where our heart is." The company spun off from a series of academic research projects Chen completed on his way to a PhD in computer science from UC Berkeley. One of his projects was a video titled "Data in the First Mile," which investigated how community health workers in Africa automated the collection of health data even though they used paper forms, rather than a direct interface to an online system.

In the slums of Kenya, another goal of Chen's was to improve the reporting on a public health project to replace open-air latrines with more hygienic portable toilets. Providing computers or laptops to the workers responsible for checking that the toilets had been cleaned and serviced would have been cost-prohibitive, so the workers recorded their reports on paper and used a camera phone to submit them.

"When I lifted my head out of my dissertation and the creation of the [software] engine, I realized this problem was everywhere," he said, particularly in government. "There are paper backlogs in many agencies. In many cases, it's not being talked about, because they're handling it, but there are also a lot of instances where they're barely hanging on."

Yet the standard way of handling such problems is still to hire an army of temporary workers to type in the information. When agency heads see the technology, they tend to find it immediately applicable to a range of problems they address, but getting on their radar is the big challenge. "They have no idea this is possible."

Has meeting regulatory requirements gone from high priority to the only priority for healthcare IT? Read Health IT Priorities: No Breathing Room, an InformationWeek Healthcare digital issue.

About the Author

David F Carr

Editor, InformationWeek Government/Healthcare

David F. Carr oversees InformationWeek's coverage of government and healthcare IT. He previously led coverage of social business and education technologies and continues to contribute in those areas. He is the editor of Social Collaboration for Dummies (Wiley, Oct. 2013) and was the social business track chair for UBM's E2 conference in 2012 and 2013. He is a frequent speaker and panel moderator at industry events. David is a former Technology Editor of Baseline Magazine and Internet World magazine and has freelanced for publications including CIO Magazine, CIO Insight, and Defense Systems. He has also worked as a web consultant and is the author of several WordPress plugins, including Facebook Tab Manager and RSVPMaker. David works from a home office in Coral Springs, Florida. Contact him at [email protected]and follow him at @davidfcarr.

See more from David F Carr

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

OpenFDA Backstory: Breaking The Paperwork Backlog

About the Author

Editor's Choice

Related Topics

Recent in Leadership

Related Topics

Recent in Resilience

Related Topics

Recent in ML & AI

Related Topics

Recent in Data

Related Topics

Recent in Sustainability

Related Topics

Recent in Infrastructure

Related Topics

Recent in Software

Related Topics

<span class="ArticleBase-LargeTitle">OpenFDA Backstory: Breaking The Paperwork Backlog</span>OpenFDA Backstory: Breaking The Paperwork Backlog

About the Author

Editor's Choice

OpenFDA Backstory: Breaking The Paperwork Backlog