Spam blogs give taylor Bayouth a big headache.
He wants the social network and blog publishing site he founded, tBlog.com, to parse the words of its 200,000 members every time they post a blog and use that analysis to update their profiles. Bayouth believes such a "thought matching" system would be unique. But one of the biggest problems he faces--besides competing against much bigger competitors, such as MySpace.com--is the amount of spam disguised as blogs that hits his site.
"This is what we battle with on a daily basis," Bayouth says. "Spam could literally just kill this thing."
It's a battle that likely won't end soon. There are millions of spam blogs, or splogs, with more added every day. "It's not getting any better, and it's probably getting worse," says Tim Finin, a computer science professor at University of Maryland, Baltimore County, who co-wrote a paper about detecting splogs that was presented at an American Association for Artificial Intelligence conference in March.
Search engines designed specifically to sift through blogs, such as BlogPulse and Technorati, claim to be getting better at separating out the garbage. "Identifying spam isn't all that hard," says Natalie Glance, senior research scientist with Nielsen BuzzMetrics, which runs BlogPulse and tracks and analyzes what consumers say online about companies. "It's a game of escalation."
Who's Splogging Whom?
|(click image for larger view)|
This blog looks legit, but click a link and you'll be looking at ads for golf vacations, hard drive repairs, and divorce lawyers.
Here's one scenario: You want to test out a new programming language, so you run a blog search on it, hoping to find out about others' experiences with it. You end up at a site that looks like a blog--including a supposed blogger's name, photo, and archive of postings--but click on a posting, and you end up at a site advertising hard drive repair.
In a daily report run last month, BlogPulse identified more than 26 million blogs, with nearly 87,000 new ones within the previous 24 hours. The company indexed 828,890 posts in the same time period. Technorati reports an even bigger blogosphere: It tracks more than 35 million blogs and 1.2 million new posts each day, an average of 50,000 per hour. About 9% of new blogs are spam, reports Technorati, and 60% of pings--the messages blogs send to a centralized network service notifying of a newly published post--are from known spam sources. Technorati says it blocks these spam pings, known as spings. "Spam blogs and their cousins, spings, continue to present infrastructure providers like Technorati a challenge," founder and CEO David Sifry wrote on the site's blog.
Finin, who helps run eBiquity at the university, says Technorati is as good as any search engine at picking out splogs, but that about one out of every five new blog posts Technorati indexes is fake.
One of sploggers' newer strategies is to plagiarize material from other online sources. Then they insert a generic sentence that points to the site they're promoting. "It's not easy for even a human to tell" if the blog is real, Finin says. "It takes a minute or two."
Finin recently noticed, through Technorati, that a blog had copied content he had written about the OWL programming language. At the site, he found links to other stories that had to do with owls--not only the winged creatures, but the Temple University basketball team, a bar in Baltimore, and a street in Houston.
"The person who set this up also set up hundreds of others, focused on different keywords or phrases," he says. Finin believes the site is a "splog farm" that may look legitimate now but eventually will carry ads and links to target sites.
Part of the problem is that blog search is still in its infancy and the companies doing it are small, unlike the huge companies that dominate Web search and have teams of people dedicated to researching data quality.
On top of that, results for Web search engines are ranked by relevance, which means splog sites generally don't show up on the first few pages of results. But blog search engines rank results differently. "On blog search, what people are interested in seeing is not the most relevant but the most recent," Glance says. "If there's a spam attack on a particular topic on a given day, it will be on the first page unless we filter them out."
Just like with E-mail spam, it's likely splogs will never be eliminated. The hope is they can be suppressed to the point that they won't ruin the Web experience.
Illustration by Michael Klein