Google on Thursday said that it has begun turning electronic copies of printed documents -- PDF files generated from scanned paper -- back into digital text using optical character-recognition (OCR) technology.
"In the past, scanned documents were rarely included in search results as we couldn't be sure of their content," said Google product manager Evin Levey in a blog post. "We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format."
Google is making the converted text of scanned PDFs available on its search results pages via the "View as HTML" link. As an example, this scan of a Consumer Product Safety Commission (CPSC) document about aluminum wiring repair from 2004 is viewable as HTML.
The same search, "repairing aluminum wiring," on Yahoo Search also returned the CPSC PDF as the top result, but the Yahoo's "View as HTML" link showed only blank pages. Microsoft's Live Search and Ask.com also returned the CPSC PDF as the top result. Neither offered a "View as HTML" link.
By turning images of text into text, Google expands its already massive index. As Levey points out, Google's OCR system converts pictures into thousands of words.
"This is a small but important step forward in our mission of making all the world's information accessible and useful," said Levey.
Google's approach doesn't obviate the need to consult the scanned file, however, if it contains images or diagrams. While Google appears to do a good job of converting text, its scans omit graphics. Perhaps in time its engineers will be able to isolate graphic elements in scanned PDFs and insert them into its HTML conversions.
One unfortunate consequence of this is that personal information like Social Security numbers that might have gone unnoticed in scans of court documents may now be discoverable through a Google search. Public.Resource.org, a project that aims to make public government publicly accessible, recently found about 1,700 documents with Social Security numbers or alien identification numbers out of a corpus of 2.5 million court documents that go back decades.
But that's the sort of problem that crops up when you make all the world's information accessible.