Commentary

Thomas Claburn
 

Google Books Metadata Includes Millions Of Errors

The Google Books database is riddled with errors, millions, of them by Google's count.

The Google Books database is riddled with errors, millions, of them by Google's count.In a blog post ruminating about the impact of the Google Books lawsuit settlement, the subject of much controversy of late, Geoffrey Nunberg, professor at the School of Information at UC Berkeley, wryly highlights the inaccuracy of the metadata used in the Google Books database by noting what a miraculous year 1899 was for literature.

That year, by Google's reckoning, was the publication date for "Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, André Malraux' La Condition Humaine, Stephen King's Christine, The Complete Shorter Fiction of Virginia Woolf, Raymond Williams' Culture and Society, Robert Shelton's biography of Bob Dylan, Fodor's Guide to Nova Scotia, and the Portuguese edition of the book version of Yellow Submarine, to name just a few," Nunberg observes.


More Internet Insights

White Papers

More >>

Reports

More >>

Webcasts

More >>

1899, it turns out, is a placeholder number. A metadata provider gave Google a large number of book records from Brazil that list 1899 as a default publication date, resulting in about 250,000 misdated books from this one source.

"Our providers have millions of errors like these, and we do what we can to eliminate them," acknowledges Google' engineering manager Jon Orwant in a comment on Nunberg's blog. "We have made substantial improvements over the past year, but I'm sure we can all agree there's a great deal more to do."

Many of those participating in the discussion on Nunberg's blog suggest that some of that "great deal more" could be Wikipedia-style crowdsourcing. It would be a cost-effective way -- free labor! -- to hunt down and correct errors in the Google Book database. But would anyone go for it?

As a commenter identifying himself as Nick Lamb puts it, "Volunteers have transcribed Britain's census (100+ year old census paperwork is released to the public on the basis that most people mentioned in it are long dead) and other public records which are every bit as dull as the phone book. BUT to make it happen Google need to reassure people that they're not being taken advantage of, the facts collected must be irrevocably put into the public domain."

Whether that fits with Google's long-term goals for Google Books remains to be seen. But Google-style crowdsourcing -- Knol -- hasn't exactly given Wikipedia a run for its money.

More broadly, the state of the Google Books metadata suggests that other databases that have an even greater impact on people's lives may also be rife with errors.

If Congress ever gets around to passing comprehensive online data regulation, here's to hoping that it includes a right to review and correct the data that describes who we are and what we do.


Related Reading




Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

InformationWeek encourages readers to engage in spirited, healthy debate, including taking us to task. However, InformationWeek moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing/SPAM. InformationWeek further reserves the right to disable the profile of any commenter participating in said activities.

Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.
T-Shirt Giveaway T-Shirt Giveaway: Each week we're selecting one great comment from our readers. The author of the comment will receive an InformaitonWeek Community t-shirt. So get posting!
Subscribe to RSS

Resource Links