Faced with a deadline for electronically processing and making publicly available financial disclosure information, officials at the Georgia Government Transparency and Campaign Finance Commission needed to deploy a system for digitizing a multiplicity of forms -- some of which were even handwritten in crayon. They turned to a unique data-capture system from Captricity that combines human intelligence and machine learning.
The commission is responsible for making public all of the financial disclosure and campaign contribution forms for elected officials and individuals running for office. It collects the reports and makes them available online in a searchable database.
In 2013, the Georgia legislature mandated that disclosure forms had to be transmitted to the commission from local offices by electronic filing or faxing and that the new processing system had to be in place by the end of the year. Under major budget constraints, commission officials decided that e-faxing was the easiest and most cost-effective method to implement.
But they needed a way extract the data from the faxed forms and convert it to structured output that could be digitized and integrated with their existing online filing system and database. The most formidable obstacle related to the haphazard way people forms filled in the forms.
[On the federal level, recent bipartisan legislation aims to make agencies' spending more transparent. Read Senate Unanimously Passes DATA Act.]
"Some people use Adobe and print the forms. Some people handwrite them. We've even received forms written in crayon," said Joel Perkins, CEO of Inserv360, an Atlanta-based firm that manages the commission's IT infrastructure with another Georgia company, Jaxified LLC.
The IT team tested several optical character recognition (OCR) systems, but they weren't nearly accurate enough, particularly with handwritten forms. That's when they found Captricity of Berkeley, Calif. The data-capture specialist firm uses crowdsourcing to turn difficult-to-read paper documents into actionable data within hours. When the Georgia team tested the system, the structured data returned by Captricity was 99% accurate, even for handwritten forms.
Captricity's cloud-based system leverages OCR scanning technology and manual data-entry workers from Amazon Mechanical Turk (AMT), a crowdsourcing Internet marketplace that lets "requesters" such as Captricity coordinate the use of human intelligence to perform tasks that computers are currently unable to do. The system isolates a form's individual fields, or "shreds," into distinct images. AMT workers gather content from the shreds and employ OCR algorithms that teach the computer to "read" the data. The output of the OCR engines becomes continuously more accurate.
"The more they do, the better those engines get over time," Kuang Chen, CEO and co-founder of Captricity, told InformationWeek Government. "Our customers get almost perfect data from the get-go because we use humans to also verify the output of these predictions and make sure that every single piece gets up to high-level accuracy," he said. "The verification is also crowdsourced."
Since the beginning of the year, the Georgia commission has received about 7,000 e-faxes, many of them 10 or more pages in length. Perkins estimates that the commission will process about 40,000 pages a month during the seven annual filing periods this year. All of the forms, even those filled out in crayon, flow from fax, to Captricity, to the commission's e-filing system in one smooth pass, he said.
Last November, the Food and Drug Administration announced a contract with Captricity to digitize handwritten HIPAA complaint forms using the OCR and Amazon Mechanical Turk process. About 10% of tens of thousands of HIPAA reports are submitted on paper. Previously, the forms were digitized manually by data-entry staff at FDA, a process that created a huge backlog in paperwork. Unlike the Georgia campaign forms, which are all a matter of public record, FDA documents involve security and privacy issues.
However, Captricity's shredded approach to processing documents also ensures the privacy and security of a document's content, Chen said. Crowdsourced workers see only a fragment of an entire document. "No single one of these verifiers gets to see anything outside the context of the one little shred," he said. "They don't know who it is, who it's for, or what it's about. The same trick that makes it go fast makes it secure as well."
Join us at GTEC, Canada's government technology event. Over 6,000 participants attend GTEC -- Government Technology Exhibition And Conference each year to exchange ideas and advance the business of information and communications technology (ICT) in government. Don't miss thought-provoking keynotes, workshops, panels, seminars, and roundtable discussions on a comprehensive selection of ICT topics presented by leading public sector and industry experts. Register for GTEC with marketing code MPIWKGTEC and save $100 on entire event and conference passes or for a free expo pass. It happens Oct. 27 to 30 in Ottawa.