Viewing junk tokens

I implemented this week an interface to view the details of the bayesian filter calculation (bug 451405). It will be part of JunQuilla and TaQuilla eventually. Although I was mostly motivated to add it because of the interest users might have in understanding the categorization that TaQuilla will do, I’ve found the viewing of junk tokens also very interesting.

Here’s a typical result for a spam email, using a modified version of JunQuilla:

Analysis of a junk email

The columns may need a little explaining. “Token” is the string that was detected as a token. If it is shown as a word followed by a colon (“subject:melissa,”) then it was found in the header fieldĀ  shown to the left of the colon. “skip” is an exception, it means that a large token was found of the length given by the number.

“Token %” is the percent likelihood that the particular token is found in a spam email. “Total %” is a running calculation of the likelihood that the entire message is junk, using only the shown token and those stronger. The entire list is sorted in order of decreasing token strength, so that you can see the importance of each of the tokens to the overall calculation.

Just in this one email, several interesting things come out. First, the strongest token was “skip:* 40″ which meant that a long token was found in the body of length 40. Instead of storing such long tokens, TB creates an artificial token that just shows the length. This was an incorrect indicator, as that token was a strong measure of “goodness”.

The next three indicators include “Melissa”. For some strange reason, I get lots of spam email bound for someone named Melissa Barnes, that I think lives in Florida, but that email is directed at my main caspia.com email address. So you can see the effect of lots of training that “Melissa Barnes” is junk.

Looking down some more, there is a token for the “x-spam-status” header, which is given by an upstream Spam Assassin provider. The x-spam-status header often has a lot of interesting information in it, representing tests done by Spam Assassin in its attempts to categorize the email. But those separate tests are not listed as tokens that can be analyzed separately by TB’s bayes filter. Currently, only the “subject” header is broken up into separate tokens, all other headers are listed as a single complete token. I have a bug pending (bug 476389) which will allow me to break specific headers into separate tokens, with a custom set of delimiters, which should allow me to get at that valuable information.

I’ve also learned some things about some difficulties that the bayes filter is having. The x-mozilla headers that TB adds to keep track of message information such as unread, is being added to the token list. This can have some unexpected negative results. If, for example, in the past you trained “good” messages that were read, but “junk” messages that were unread, then “unread” becomes a strong indicator of a message being spam. Yet all new emails are unread! This is not a good thing. I’m going to remove adding those headers in the future in bug 472005.

Another issue is how the bayes filter deals with tokens that are neutral, that is have probabilities near 50%. Although the bayes filter does not use tokens with probability >40% and <60%, often in a long spam email there are a lot of random words that are are in the 30s and 60s percent. These words, appearing randomly, seem to force the calculation closer to 50% – which will not be detected as spam (by default TB uses 90% as the spam detection limit. I use 75%). I need to investigate this further, as you should not be able to defeat the spam filter by filling you message with a lot of common words.

Share

1 comment to Viewing junk tokens

Leave a Reply

  

  

  

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

To comment, click below to log in.