Langa Letter: Real-Life Spam Solutions - InformationWeek
Software // Enterprise Applications
04:00 PM
Fred Langa
Fred Langa

Langa Letter: Real-Life Spam Solutions

A new generation of anti-spam tools is just around the corner. But until then, these spam blockers and handlers may be the next best thing.

Pattern Matching
You might be thinking, "Why bother with blacklists? A spell-checker can examine every word in a document, so why not build a spam-checker that works the same way, looking at the actual word-by-word content of an E-mail? That way, a spam filter can react to the actual content of an E-mail, using pattern-matching to scan for spamlike words or phrases. If the program finds these telltale signs of spamishness, it could dump the suspect E-mail into the trash or into a spam folder."

It does sound simple, and many anti-spam programs have tried to do just that. But pattern matching has its own problems:

As a very simple example, consider the word "click." Something like 80% of spam has an exhortation to "click" somewhere in it, so a simple spam filter could claim an 80% success rate simply by declaring all E-mails containing that word as spam. But think of how many totally innocent uses there are for the word "click." All those "false positive" E-mails would also be trapped by this kind of too-simple filtering.

In fact, pattern-matching is notorious for its false positives. One reader told me he can't sign his E-mails with his preferred nickname, Dick, because crude filters will mistake his name for a naughty word. Writer Esther Schindler speaks of a friend who lives in California's wine country who can't send E-mails about a favorite chardonnay because content filters ignore the name of the wine, and see only the embedded six-letter string that starts with "h" within "chardonnay." Famously, the Essex County government in England has enormous trouble getting its E-mails delivered because of the word "sex" embedded within "Essex." There are many, many other examples, too. (See Silent Censorship.)

These pattern-matching errors may seem funny when related this way, but it's no joke when you find that some vital (and totally innocuous) E-mail has been eaten by a content-oriented filter. Once again, a too-crude tool often ends up doing more harm than good.

Better Alternatives
Fortunately, some very talented people are engaged in deep thinking about the problem of accurate spam detection. For example, Paul Graham has developed a method of statistical filtering using formal Bayesian mathematical analysis to predict how likely a given E-mail is--or, just as importantly, is not--to be spam.

A Bayesian analysis statistically evaluates all the words in an E-mail, not just a few trigger words. In a rough sense, you can think of a Bayesian analysis as looking at words in context rather than in isolation. This approach avoids many false positives. "Because it is measuring probabilities, the Bayesian approach considers all the evidence in the E-mail, both good and bad," Graham says. "Words that occur disproportionately rarely in spam ... contribute as much to decreasing the probability [of the mail being spam] as bad words ... do to increasing it. So an otherwise innocent E-mail that happens to include the word 'sex' is not going to get tagged as spam."

Graham's full--and excellent--article on Bayesian spam filtering appears here, and it also discusses a spam tool he's working on that employs this technique. His site also includes a thoughtful analysis of why spam is different from other forms of advertising and a comparison of Filters Vs. Blacklists that should be required reading for anyone who still thinks blacklists are useful tools.

2 of 4
Comment  | 
Print  | 
More Insights
Newest First  |  Oldest First  |  Threaded View
How Enterprises Are Attacking the IT Security Enterprise
How Enterprises Are Attacking the IT Security Enterprise
To learn more about what organizations are doing to tackle attacks and threats we surveyed a group of 300 IT and infosec professionals to find out what their biggest IT security challenges are and what they're doing to defend against today's threats. Download the report to see what they're saying.
Register for InformationWeek Newsletters
White Papers
Current Issue
2017 State of the Cloud Report
As the use of public cloud becomes a given, IT leaders must navigate the transition and advocate for management tools or architectures that allow them to realize the benefits they seek. Download this report to explore the issues and how to best leverage the cloud moving forward.
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
Join us for a roundup of the top stories on for the week of November 6, 2016. We'll be talking with the editors and correspondents who brought you the top stories of the week to get the "story behind the story."
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.
Flash Poll