Using optical character-recognition technology, Google will make the converted text of scanned PDFs available on its search results pages via the "View as HTML" link.

Thomas Claburn, Editor at Large, Enterprise Mobility

October 31, 2008

2 Min Read

Google on Thursday said that it has begun turning electronic copies of printed documents -- PDF files generated from scanned paper -- back into digital text using optical character-recognition (OCR) technology.

"In the past, scanned documents were rarely included in search results as we couldn't be sure of their content," said Google product manager Evin Levey in a blog post. "We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format."

Google is making the converted text of scanned PDFs available on its search results pages via the "View as HTML" link. As an example, this scan of a Consumer Product Safety Commission (CPSC) document about aluminum wiring repair from 2004 is viewable as HTML.

The same search, "repairing aluminum wiring," on Yahoo Search also returned the CPSC PDF as the top result, but the Yahoo's "View as HTML" link showed only blank pages. Microsoft's Live Search and Ask.com also returned the CPSC PDF as the top result. Neither offered a "View as HTML" link.

By turning images of text into text, Google expands its already massive index. As Levey points out, Google's OCR system converts pictures into thousands of words.

"This is a small but important step forward in our mission of making all the world's information accessible and useful," said Levey.

Google's approach doesn't obviate the need to consult the scanned file, however, if it contains images or diagrams. While Google appears to do a good job of converting text, its scans omit graphics. Perhaps in time its engineers will be able to isolate graphic elements in scanned PDFs and insert them into its HTML conversions.

One unfortunate consequence of this is that personal information like Social Security numbers that might have gone unnoticed in scans of court documents may now be discoverable through a Google search. Public.Resource.org, a project that aims to make public government publicly accessible, recently found about 1,700 documents with Social Security numbers or alien identification numbers out of a corpus of 2.5 million court documents that go back decades.

But that's the sort of problem that crops up when you make all the world's information accessible.

About the Author(s)

Thomas Claburn

Editor at Large, Enterprise Mobility

Thomas Claburn has been writing about business and technology since 1996, for publications such as New Architect, PC Computing, InformationWeek, Salon, Wired, and Ziff Davis Smart Business. Before that, he worked in film and television, having earned a not particularly useful master's degree in film production. He wrote the original treatment for 3DO's Killing Time, a short story that appeared in On Spec, and the screenplay for an independent film called The Hanged Man, which he would later direct. He's the author of a science fiction novel, Reflecting Fires, and a sadly neglected blog, Lot 49. His iPhone game, Blocfall, is available through the iTunes App Store. His wife is a talented jazz singer; he does not sing, which is for the best.

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights