Managing spam with “after classification” filters

Nightly builds after 2009-08-19 of Thunderbird (or upcoming 3.0 beta 4) and SeaMonkey  (or upcoming 2.0 beta 2) include a new ability to apply message filters after the internal spam filter has classified the message. Previously, filtering was always done before spam classification, which meant that you could not use any results of the spam classification in a filter.

The default spam processing that is available without using filters (whitelisting, move or delete messages with a sufficiently high threshold) should be sufficient for most users. But for people with special requirements you can now implement those requirements in a filter with customizations. Let me give examples in this posting.

Using the “after classification” filters

Proper care and feeding of spam really needs to classify messages in three ways. Some messages can be easily detected as spam, and should never be looked at. Others are clearly ham, and should be treated as real. But those in the middle need some handling, which may be either training, or perhaps examined weekly to make sure no false positives are there. Default Mozilla mailnews (which is the generic term for features that are available in any of the applications created from this codebase, including Thunderbird and SeaMonkey) junk management doesn’t provide any capability to manage these uncertain emails. My JunQuilla extension provides an Uncertain folder which is focused on the training issue, but with the new¬† filter features you can have more precise control of this. (Currently you can’t install JunQuilla in SeaMonkey, but I will fix that eventually).

First, let’s see what is new and how it can be enabled.

Create a new filter by selecting Tools/Message Filters … then New. Open up the search attributes menu, and you’ll see something like this:

NoJunkOptions

No junk options! To get those, you’ll need to first select one of the “after classification” contexts from “Apply Filter When”. Then you’ll see something like this:

WithJunkOptions

If “Checking mail (after classification)” is disabled, then you probably are trying to set an after-classification filter on a POP3 account that is actually sending its email to another location (the so-called deferred-to server). You need instead to set the “after classification” filter on the “deferred-to” server, which is typically Local Folders.

Let me explain each of these search attributes.

Junk Percent is the score returned from the bayes filter when classifying the message, with 100 being the most likely to be junk, and 0 the least likely. The default setting in Thunderbird classifies a message as junk when this score is 90 or greater. It is sort of a probablility, but not really because too many false assumptions are made in the Naive Bayesian Classifier for this to really be a probability. Just treat it like a score. Unfortunately default installs of Thunderbird and SeaMonkey do not provide you with any way to see the value of this on typical messages. JunQuilla though provides a custom column that shows this on each message so you can get a feel for what typical values are.

Junk Status is pretty simple, it either Is or Isnt Junk. In the normal case where the internal bayes filter is used to classify the message, this means it had a junk percent of greater than 90.

Junk Score Origin shows you who classified the message. Its values are:

Plugin: the bayes filter.

User: you manually classified this message as junk or good (not useful in an incoming filter, but maybe in a manual filter or search).

Filter: a previous filter action set the junk status.

Whitelist: the spam processing decided this message was good because it was from someone in your address book.

IMAP Flag: this message was classified by another system, so we know it is junk or good, but don’t know why the other system classified it this way. You might see this if you access mail from more than one computer.

Default Thunderbird does not support any way to see the junk score origin on individual messages, though JunQuilla provides a Junk Status + column which uses different icons for each junk score origin.

Classifying messages as uncertain

So let’s design a filter that will move messages to an Uncertain folder if we want to examine them, but not have them clutter the inbox. That’s pretty easy, we’ll just move messages with a junk percent in a certain range to that folder:

UncertainJunk

The order of message processing in Mozilla mailnews is:

  1. Run normal filters (on each message as it is received)
  2. Check whitelisting (on a message batch, this and subsequent steps)
  3. Run bayes classifier on non-whitelisted messages, and mark messages as junk or good.
  4. Apply “after classification” filters.
  5. Apply junk message moves using default junk processing.

So at least in theory, you can apply the “after classification” filter to the Uncertain messages, and still let the default junk processing move junk messages to a Junk folder. (Testing of this is welcomed!)

Weak Whitelisting

As a more complex example, spammers are starting to send out emails that have spoofed From addresses that match the domain of your email, figuring that there is a chance that you have these other addresses whitelisted, so you’ll get the spam. To fight this, we’ll setup a filter that does a whitelist that is slightly weaker than the usual all-or-nothing whitelist on those easily spoofed addresses. Because whitelisting occurs before spam processing, and no score will be assigned if the message is whitelisted, you will need to disable the default whitelisting functionality, and rely entirely on message filters for this to work.

We’ll add the following search terms, all of which must match to apply our weak whitelist:

  1. From address appears in an address book (this is normal whitelisting)
  2. My domain appears in the address (because that is easily spoofed)
  3. Junk Score Origin is Plugin (this prevents the filter from running on messages that we classified, in case we run it manually on existing folders).
  4. Junk Status is Junk (we’ll only whitelist if the bayes filter thought it was junk. I only do this so that I can see that the filter decided to override the decision of the bayes processor, which needs the Junk Status + column from JunQuilla to see.)
  5. Junk Percent is less than 95 (since the bayes filter only marked messages as junk with the percent > 90, this means that we will override the bayes decision for messages between 90 and 95 in score).

Putting this all together, you get a filter that looks like this:

WeakWhitelist

You would also need to define a filter that is applied after this one, that whitelists any messages that meet the normal whitelist criteria, but were marked as junk by the bayes filter.

I’m not necessarily recommending this filter, it was meant as a demonstration. But I hope you can see that the new ability to use the bayes filter in combination with other message criteria in a filter provides lots of new possibilities for more precise handling of possible spam messages.