Google Makes Scanned Documents Searchable - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Government // Mobile & Wireless
News
10/31/2008
02:20 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Google Makes Scanned Documents Searchable

Using optical character-recognition technology, Google will make the converted text of scanned PDFs available on its search results pages via the "View as HTML" link.

Google on Thursday said that it has begun turning electronic copies of printed documents -- PDF files generated from scanned paper -- back into digital text using optical character-recognition (OCR) technology.

"In the past, scanned documents were rarely included in search results as we couldn't be sure of their content," said Google product manager Evin Levey in a blog post. "We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format."

Google is making the converted text of scanned PDFs available on its search results pages via the "View as HTML" link. As an example, this scan of a Consumer Product Safety Commission (CPSC) document about aluminum wiring repair from 2004 is viewable as HTML.

The same search, "repairing aluminum wiring," on Yahoo Search also returned the CPSC PDF as the top result, but the Yahoo's "View as HTML" link showed only blank pages. Microsoft's Live Search and Ask.com also returned the CPSC PDF as the top result. Neither offered a "View as HTML" link.

By turning images of text into text, Google expands its already massive index. As Levey points out, Google's OCR system converts pictures into thousands of words.

"This is a small but important step forward in our mission of making all the world's information accessible and useful," said Levey.

Google's approach doesn't obviate the need to consult the scanned file, however, if it contains images or diagrams. While Google appears to do a good job of converting text, its scans omit graphics. Perhaps in time its engineers will be able to isolate graphic elements in scanned PDFs and insert them into its HTML conversions.

One unfortunate consequence of this is that personal information like Social Security numbers that might have gone unnoticed in scans of court documents may now be discoverable through a Google search. Public.Resource.org, a project that aims to make public government publicly accessible, recently found about 1,700 documents with Social Security numbers or alien identification numbers out of a corpus of 2.5 million court documents that go back decades.

But that's the sort of problem that crops up when you make all the world's information accessible.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
InformationWeek Is Getting an Upgrade!

Find out more about our plans to improve the look, functionality, and performance of the InformationWeek site in the coming months.

Commentary
Why IT Leaders Should Make Cloud Training a Top Priority
John Edwards, Technology Journalist & Author,  4/14/2021
Slideshows
10 Things Your Artificial Intelligence Initiative Needs to Succeed
Lisa Morgan, Freelance Writer,  4/20/2021
Commentary
Lessons I've Learned From My Career in Technology
Guest Commentary, Guest Commentary,  5/4/2021
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
Planning Your Digital Transformation Roadmap
Download this report to learn about the latest technologies and best practices or ensuring a successful transition from outdated business transformation tactics.
Slideshows
Flash Poll