It does sound simple, and many anti-spam programs have tried to do just that. But pattern matching has its own problems:
In fact, pattern-matching is notorious for its false positives. One reader told me he can't sign his E-mails with his preferred nickname, Dick, because crude filters will mistake his name for a naughty word. Writer Esther Schindler speaks of a friend who lives in California's wine country who can't send E-mails about a favorite chardonnay because content filters ignore the name of the wine, and see only the embedded six-letter string that starts with "h" within "chardonnay." Famously, the Essex County government in England has enormous trouble getting its E-mails delivered because of the word "sex" embedded within "Essex." There are many, many other examples, too. (See Silent Censorship.)
These pattern-matching errors may seem funny when related this way, but it's no joke when you find that some vital (and totally innocuous) E-mail has been eaten by a content-oriented filter. Once again, a too-crude tool often ends up doing more harm than good.
Better Alternatives
A Bayesian analysis statistically evaluates all the words in an E-mail, not just a few trigger words. In a rough sense, you can think of a Bayesian analysis as looking at words in context rather than in isolation. This approach avoids many false positives. "Because it is measuring probabilities, the Bayesian approach considers all the evidence in the E-mail, both good and bad," Graham says. "Words that occur disproportionately rarely in spam ... contribute as much to decreasing the probability [of the mail being spam] as bad words ... do to increasing it. So an otherwise innocent E-mail that happens to include the word 'sex' is not going to get tagged as spam."
Graham's full--and excellent--article on Bayesian spam filtering appears here, and it also discusses a spam tool he's working on that employs this technique. His site also includes a thoughtful analysis of why spam is different from other forms of advertising and a comparison of Filters Vs. Blacklists that should be required reading for anyone who still thinks blacklists are useful tools.
You might be thinking, "Why bother with blacklists? A spell-checker can examine every word in a document, so why not build a spam-checker that works the same way, looking at the actual word-by-word content of an E-mail? That way, a spam filter can react to the actual content of an E-mail, using pattern-matching to scan for spamlike words or phrases. If the program finds these telltale signs of spamishness, it could dump the suspect E-mail into the trash or into a spam folder."
Fortunately, some very talented people are engaged in deep thinking about the problem of accurate spam detection. For example, Paul Graham has developed a method of statistical filtering using formal Bayesian mathematical analysis to predict how likely a given E-mail is--or, just as importantly, is not--to be spam.
Page 3:
Langa Letter: Real-Life Spam Solutions
![]()
« Previous Page
|
1
|
2
|
3
|
4
Next Page »
Achieving Successful Coexistence Between Notes and Microsoft Platforms
Learn about the key migration and coexistence challenges youżll face when considering migration from IBM Lotus Notes to Microsoft Exchange and Microsoft SharePoint Server. Get best practices for planning and executing a successful coexistence strategy, and discover how you can ensure seamless coexistence between the Lotus and Microsoft environments.
NOTE: Offer valid for U.S., U.S. possessions, & Canada only.