Welcome Guest. | Log In| Register | Membership Benefits

  • Email this page E-mail
  • |  Print Print
  • |   Bookmark and Share
  • icon

IBM Aids British Library With Web Archive


Using a technology called BigSheets, IBM is assisting in the British Library's effort to preserve the U.K.'s online culture.



The British Library on Thursday plans to announce its official UK Web Archive, a project that uses an IBM analytics technology prototype called BigSheets to expand the preservation of Web pages in the .uk domain and make them more accessible.

According to the British Library, the average life expectancy of a Web site is between 44 and 75 days and every six months, 10% of .uk Web pages vanish or are replaced by new material.

More Storage Insights

Videos

There are plenty of cloud services, but many rely on virtual processes running on shared systems. Stratascale's Ironscale lets you provision bare metal services. Mike Fratto provides a hands-on review of how it works and what you can do with it. In Part 2 of our Whiteboard Tutorial Series on SSD, George Crump of Storage Switzerland takes us through how SSD can affect performance. SSD has a price premium, and organizations must ensure that their apps can exploit the technology to the fullest. InformationWeek's Michael Singer talks with Cleversafe's President and CEO, Chris Gladwin. Cleversafe describes itself as the leader in dispersed storage technology, which is the ideal way to store large, growing amounts of digital assets cost effect
InformationWeek's Michael Singer talks with Cleversafe's President and CEO, Chris Gladwin. Cleversafe describes itself as the leader in dispersed storage technology, which is the ideal way to store large, growing amounts of digital assets cost effect

"With so much material now published online, and considering the growing influence of the Internet on British culture and society, the Web is now a key part of the nation's memory," said Margaret Hodge, the U.K.'s Minister of Culture and Tourism, in a statement. "A failure to record and preserve the UK domain would not just be detrimental to future research but leave a significant gap in our digital heritage."

The .uk Internet domain currently consists of about 8 million Web pages and is expected to reach 11 million by 2011. The British Library currently has 10 people manually archiving the 5 terabytes of U.K. Web page data.

IBM's contribution to the archiving project, BigSheets, is built atop the Apache Hadoop framework, a system for distributed data processing inspired by Google's MapReduce and Google File System, and developed in recent years by Yahoo and others.

"We think of these as big worksheets," said Rod Smith, VP of emerging Internet technologies at IBM, who stresses that the project goes beyond archiving. "You'd like to be more valuable to people than just an archive. In the British Library's case, you'd like to be known as the accurate holder of historical information."

BigSheets will allow British Library researchers, and eventually library patrons, to access Web archive data, conduct queries and visualize the results in forms like a tag cloud or pie chart, for example.

It's about ways to explore and sift data, says Smith.

Smith says it's still too early in the project's evolution to determine whether BigSheets will be adopted by other archiving organizations, like the Internet Archive.

Network Computing has published an in-depth report on the state of enterprise storage. Download it here (registration required).


Subscribe to RSS


Advertisement

Sponsored Links







      


Get InformationWeek in Print

Apply for a free 1-year subscription to InformationWeek (a $199 value)



NOTE: Offer valid for U.S., U.S. possessions, & Canada only.