Google Makes Scanned Documents Searchable - InformationWeek

InformationWeek is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them.Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.

IoT
IoT
Government // Mobile & Wireless
News
10/31/2008
02:20 PM
Connect Directly
LinkedIn
Twitter
RSS
E-Mail
50%
50%

Google Makes Scanned Documents Searchable

Using optical character-recognition technology, Google will make the converted text of scanned PDFs available on its search results pages via the "View as HTML" link.

Google on Thursday said that it has begun turning electronic copies of printed documents -- PDF files generated from scanned paper -- back into digital text using optical character-recognition (OCR) technology.

"In the past, scanned documents were rarely included in search results as we couldn't be sure of their content," said Google product manager Evin Levey in a blog post. "We had occasional clues from references to the document -- so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe's PDF format."

Google is making the converted text of scanned PDFs available on its search results pages via the "View as HTML" link. As an example, this scan of a Consumer Product Safety Commission (CPSC) document about aluminum wiring repair from 2004 is viewable as HTML.

The same search, "repairing aluminum wiring," on Yahoo Search also returned the CPSC PDF as the top result, but the Yahoo's "View as HTML" link showed only blank pages. Microsoft's Live Search and Ask.com also returned the CPSC PDF as the top result. Neither offered a "View as HTML" link.

By turning images of text into text, Google expands its already massive index. As Levey points out, Google's OCR system converts pictures into thousands of words.

"This is a small but important step forward in our mission of making all the world's information accessible and useful," said Levey.

Google's approach doesn't obviate the need to consult the scanned file, however, if it contains images or diagrams. While Google appears to do a good job of converting text, its scans omit graphics. Perhaps in time its engineers will be able to isolate graphic elements in scanned PDFs and insert them into its HTML conversions.

One unfortunate consequence of this is that personal information like Social Security numbers that might have gone unnoticed in scans of court documents may now be discoverable through a Google search. Public.Resource.org, a project that aims to make public government publicly accessible, recently found about 1,700 documents with Social Security numbers or alien identification numbers out of a corpus of 2.5 million court documents that go back decades.

But that's the sort of problem that crops up when you make all the world's information accessible.

We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.
Comment  | 
Print  | 
More Insights
News
How to Create a Successful AI Program
Jessica Davis, Senior Editor, Enterprise Apps,  10/14/2020
News
Think Like a Chief Innovation Officer and Get Work Done
Joao-Pierre S. Ruth, Senior Writer,  10/13/2020
Slideshows
10 Trends Accelerating Edge Computing
Cynthia Harvey, Freelance Journalist, InformationWeek,  10/8/2020
White Papers
Register for InformationWeek Newsletters
Video
Current Issue
[Special Report] Edge Computing: An IT Platform for the New Enterprise
Edge computing is poised to make a major splash within the next generation of corporate IT architectures. Here's what you need to know!
Slideshows
Flash Poll