Software // Enterprise Applications
Commentary
11/14/2002
04:00 PM
Fred Langa
Fred Langa
Commentary
Connect Directly
RSS
E-Mail
50%
50%

Langa Letter: Real-Life Spam Solutions

A new generation of anti-spam tools is just around the corner. But until then, these spam blockers and handlers may be the next best thing.

Pattern Matching
You might be thinking, "Why bother with blacklists? A spell-checker can examine every word in a document, so why not build a spam-checker that works the same way, looking at the actual word-by-word content of an E-mail? That way, a spam filter can react to the actual content of an E-mail, using pattern-matching to scan for spamlike words or phrases. If the program finds these telltale signs of spamishness, it could dump the suspect E-mail into the trash or into a spam folder."

It does sound simple, and many anti-spam programs have tried to do just that. But pattern matching has its own problems:

As a very simple example, consider the word "click." Something like 80% of spam has an exhortation to "click" somewhere in it, so a simple spam filter could claim an 80% success rate simply by declaring all E-mails containing that word as spam. But think of how many totally innocent uses there are for the word "click." All those "false positive" E-mails would also be trapped by this kind of too-simple filtering.

In fact, pattern-matching is notorious for its false positives. One reader told me he can't sign his E-mails with his preferred nickname, Dick, because crude filters will mistake his name for a naughty word. Writer Esther Schindler speaks of a friend who lives in California's wine country who can't send E-mails about a favorite chardonnay because content filters ignore the name of the wine, and see only the embedded six-letter string that starts with "h" within "chardonnay." Famously, the Essex County government in England has enormous trouble getting its E-mails delivered because of the word "sex" embedded within "Essex." There are many, many other examples, too. (See Silent Censorship.)

These pattern-matching errors may seem funny when related this way, but it's no joke when you find that some vital (and totally innocuous) E-mail has been eaten by a content-oriented filter. Once again, a too-crude tool often ends up doing more harm than good.

Better Alternatives
Fortunately, some very talented people are engaged in deep thinking about the problem of accurate spam detection. For example, Paul Graham has developed a method of statistical filtering using formal Bayesian mathematical analysis to predict how likely a given E-mail is--or, just as importantly, is not--to be spam.

A Bayesian analysis statistically evaluates all the words in an E-mail, not just a few trigger words. In a rough sense, you can think of a Bayesian analysis as looking at words in context rather than in isolation. This approach avoids many false positives. "Because it is measuring probabilities, the Bayesian approach considers all the evidence in the E-mail, both good and bad," Graham says. "Words that occur disproportionately rarely in spam ... contribute as much to decreasing the probability [of the mail being spam] as bad words ... do to increasing it. So an otherwise innocent E-mail that happens to include the word 'sex' is not going to get tagged as spam."

Graham's full--and excellent--article on Bayesian spam filtering appears here, and it also discusses a spam tool he's working on that employs this technique. His site also includes a thoughtful analysis of why spam is different from other forms of advertising and a comparison of Filters Vs. Blacklists that should be required reading for anyone who still thinks blacklists are useful tools.

Previous
2 of 4
Next
Comment  | 
Print  | 
More Insights
Building A Mobile Business Mindset
Building A Mobile Business Mindset
Among 688 respondents, 46% have deployed mobile apps, with an additional 24% planning to in the next year. Soon all apps will look like mobile apps – and it's past time for those with no plans to get cracking.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Tech Digest - September 10, 2014
A high-scale relational database? NoSQL database? Hadoop? Event-processing technology? When it comes to big data, one size doesn't fit all. Here's how to decide.
Flash Poll
Video
Slideshows
Twitter Feed
InformationWeek Radio
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.