On the Evolution of Clusters of Near-Duplicate Web Pages
Click here to download now
Overview: This paper describes a large-scale study on the prevalence and evolution of clusters of very similar (""Near-duplicate"") web pages. It confirms Broder et al.s observation of widespread duplication of web pages. In particular, the paper found that about 28% of all web pages are duplicates of some pages in the remaining 72%, and 22% are virtually identical. It examines the documents in the 20 largest clusters and categorized them. The present study also examines the rate at which documents exit clusters, and found that clusters are fairly stable over time; clusters of intermediate size are generally the least stable.