Earlier this year, one of our online columnists, Fred Langa, wrote about an experiment he'd conducted to see how much E-mail is lost due to spam and spam filters. Read his original, thought-provoking column "Langa Letter: E-Mail--Hideously Unreliable," Jan. 12, 2004. There followed much discussion about whether the experiment itself was flawed. I'll leave it to others to argue that.
A perhaps more relevant thread deals with one category of anti-spam tool called a Bayesian filter. Such a filter selects words and numbers from E-mail text and compares their ratio between good mail and spam. Using that ratio, such a filter calculates the probability of new E-mail being spam.
In a thread titled "Bayesian the Best?," "James Becker" took issue with Langa's assertion that they're the best choice available. "They're only as good as the messages you give them. Pick the wrong messages [to analyze] and you create a false pattern. Suppose that on the morning of Feb. 1, I find 10 pieces of spam. All my good E-mail at that point had been delivered over the previous month."
Becker says a Bayesian filter might very well incorrectly deduce that E-mail with February somewhere in it is spam.
And "Bayesian filters [don't] do well with small samples," he says. "I have colleagues who complain when they get multiple pieces of spam within the same week. A Bayesian filter won't spot a strong or true pattern when the sample size is small." Actually, that sounds more like a human-patience problem to me.
So let's assume that you've given your filter the optimal variety and number of E-mail samples to analyze. You're done, right? No. There's the "evolutionary flaw" scenario.
"Spammers ... try to devise messages that don't quite look like previous messages because they know the filters are out there. Therefore, once you've given your Bayesian filter lots of examples, it has a strong idea of what spam looked like a few months ago. If recent spam content has evolved from old spam content, your Bayesian filter is behind the times. It might even have some unlearning to do before it starts to learn the new stuff."
Good advice from Becker, but he ends with the oldest cop-out in the book--that technology can't solve all of our problems. Filters just guess, so they'll always be inaccurate. "Unwelcomeness is an individual, subjective judgment on the part of the recipient." I was with you until you blasphemed, Becker. Your 15 minutes are up.
I was worried about "Cindy Harris" when she wrote, "The human brain is still the best pattern-recognition tool we know."
But she goes on to say Becker sells Bayesian filters short. "The whole point of such a filter is to feed it all of the stuff that regularly comes through your box. You feed it everything and correct every error. The more diligent you are, the more accurate your filter. At the beginning, the filter will tend to make wrong guesses, and you'll have to reclassify a lot of mail, but a Bayesian filter learns."
Harris says that even evolutionary spam changes largely can be addressed by Bayesian filters. "A new strategy will slip through the filter at first, but the more widespread it becomes, the more quickly a Bayesian filter will learn to recognize it and dump similar messages."
"Spam? What's that? Oh, right, that's the unwanted E-mail I used to get before I switched to a Mac running Panther and MacMail. Apple's filter is damn near perfect (I get well under 1% false positives or false negatives) after a couple of weeks. Click on 'junk' or 'not junk,' and its Bayesian filter is updated."
Wow, Mac folks are a plucky (if foul-mouthed) group, huh? Be plucky, but be civil in the Listening Post.
The Business of Going DigitalDigital business isn't about changing code; it's about changing what legacy sales, distribution, customer service, and product groups do in the new digital age. It's about bringing big data analytics, mobile, social, marketing automation, cloud computing, and the app economy together to launch new products and services. We're seeing new titles in this digital revolution, new responsibilities, new business models, and major shifts in technology spending.