JunQuilla

Download

JunQuilla provides additional capability to Mozilla Thunderbird to assist in the training of the junk mail classifier. JunQuilla is available here at Mozilla’s AMO website. JunQuilla requires Thunderbird 3.0 or SeaMonkey version 2.0. Questions, comments, or bug reports can be made to the JunQuilla forum on this site. The feature description below applies to version 1.0.0  Changes in this revision can be seen on the JunQuilla revisions page.

Features

JunQuilla adds the following features to the base Thunderbird product:

  • Junk percent column. The Bayes statistical classifier that determines if a message is junk or not calculates a numerical score, similar to a probability, to determine if a message is junk or good. JunQuilla displays this score, when available, for each message.
  • Junk status plus column. The standard junk status column in Thunderbird only displays two icons, one when the message is junk, another when the message is either good or unclassified. JunQuilla adds a junk status plus column, which shows these three states separately, as well as additional marks that show why the message was classified that way (for example, by the user, by the bayesian classifier, by the addressbook whitelist, etc.)
  • Uncertain search folders. For each inbox, JunQuilla creates a search subfolder that shows messages that the bayesian filter had difficulty classifying. This allows users to only check a small number of messages for misclassifications, as well as provides a convenient display of messages that should be trained.
  • User interface for critical junk parameters. Two hidden preferences with a large effect on junk are displayed for user adjustment in the addon option box. In addition, the default value of one of these parameters (the maximum number of bayes tokens) is increased to improve junk performance.
  • Display of junk analysis details. In the message context menu (and the main message menu), a new item Junk Analysis Detail will popup a screen showing a list of all tokens used to analyze the junk status of the selected email.
  • Folder property to disable junk processing: Junk processing on a folder and its subfolders may be selectively disabled.
  • Junk processing of RSS and News folders: Junk processing may be enabled on RSS and news folders.
  • Toolbar buttons for Is Junk and Is Good: The standard junk toolbar button allows you to toggle the junk status of a message. But that makes it difficult to mark a message as good, unless that message is previously marked as junk. Yet training of good messages is critical to the success of the junk filter, so we add training buttons that are not toggles, but act separately to train as message as junk, or as good.
  • Display of message and token counts in the training file: You can now see how many total tokens you have in the training file, as well as how many good and junk messages were used in that training.
  • Headers that are always unique are not tokenized: Certain headers (message ID, Date for example) are virtually always either unique or unrelated to junk status, so those are now disabled to both save space and improve accuracy.

Motivation

There are several practical problems that JunQuilla is designed to solve, that limit the effectiveness of the standard implementation of junk mail processing in the Mozilla mailnews implementations (Thunderbird and SeaMonkey).

Users understandably are quite concerned with preventing the junk filter from falsely classifying an important email as junk. As a result, the threshold for detecting a message as junk is set very high by default, and ideally no good messages are ever classified as junk. Yet for the junk filter to work at all, users must train both junk and good messages so that it can tell the difference. There is no easy, standard way to view messages that are near-junk and should be confirmed by the user. JunQuilla addresses this by adding by default an “Uncertain” search folder, which represents requests for training, and includes both good and junk messages. When users train messages in this folder, they give the classifier a balanced diet of both good and junk training. That folder’s icon changes as a hint to the user that they need to train messages.

There is a critical parameter, the junk cutoff threshold, that must be set so that the program can properly mark messages as junk. But that parameter is hidden in the standard program. Also, the user never gets any feedback on the performance of their junk filtering that might be used as an aid to set that parameter. So JunQuilla provides a user interface to change the junk threshold, as well as displays the existing junk percentage calculations in a column so that users get that feedback. By seeing the calculation done by the junk filter, users can learn what is a reasonable, safe cutoff point for junk classification, and set it appropriately. (I use a 75% cutoff rather than the default 90% for example).

In addition to the issues above that are focused on improving the performance of junk filtering, JunQuilla also provides interfaces to various parameters and calculations that can help advanced users optimize their junk filter usage, or diagnose problems.

Usage

First, you must implement Thunderbird’s junk filtering. I’ll assume you have done that, trained some good and junk email, and are getting some effective results from that.

Uncertain Folders

The heart of JunQuilla are the Uncertain search folders, where Thunderbird displays emails whose junk status is uncertain, and are therefore good candidates to train. When you first install JunQuilla, it should create an Uncertain search folder under the Inbox folder for each email account that you have. You can also manually remove or add these Uncertain folders by going to Tools/Addons/JunQuilla Options. The Uncertain folders can be renamed and edited like any other search folder, but they cannot currently be moved without losing the special handling that changes their display icon when messages need training.

The plan of these folders is that you, the user, will decide whether each of the emails in the Uncertain folders are junk or spam, and train Thunderbird. After an email is trained, it will no longer appear in the Uncertain folder (though you may need to leave and reenter the folder to see the change, due to a current limitation of Thunderbird.) The icon for the folder changes when it contains messages that need training as an indicator that you need to act.

The Uncertain folder looks like this when there are no messages needing training:

An Uncertain folder with no training required.

When there are messages that need training, the icon changes and it appears like this:

When there are messages that need training, you should go to the Uncertain folder and train all messages that appear there. The easiest way to do this is with the keyboard shortcut – J for junk, and shift-J for good. You can also use the “Is Junk” and “Is Good” toolbar buttons. If you have a lot of messages to train, sort the folder by good/junk probability by clicking on the Junk Percent column header () then select many messages to train at once.

The Uncertain folders assume that all messages are arriving in the Inbox for each account. If you have filters that move messages to other folders,you may want to edit the properties of the Uncertain folder to add those extra folders to its search scope.

Junk Columns

JunQuilla adds two new columns to the message pane: Junk Percent (Junk Percent) and Junk Status Plus (Junk Status Plus).

Junk Percent Junk Percent shows the junk indicator from the Bayes filter, similar to a probability, with 0 meaning certainly good, and 100 meaning certainly junk. This column only shows the result from the Bayes statistical calculations, so if those were bypassed for some reason (like the message sender was in your address book, and was whitelisted) then nothing appears here. Also, if the status is changed after the Bayes calculations (for example if you manually specify that the message is good or junk) then the value does not change.

Junk Status Plus Junk Status Plus combines a display of the determined junk status (good Good, junk Junk, or unclassified Unclassified) with markers for the source of the classification. The images used, and their meanings, are:

Good Bayes , Junk Bayes : Classification set by the Bayes statistical filter based on message content.

Good User , Junk User : Classification set by the user. This also means that the message has been used to train the Bayes filter.

Whitelisted : Message was whitelisted because the sender appeared in an address book.

Good IMAP flag , Junk IMAP flag : Classification was determined by a flag on an IMAP message. When Thunderbird classifies a message as junk or good, and the IMAP server supports custom flags (most do) then that status is stored on the IMAP server. If for some reason Thunderbird does not have a local copy of the junk history of the email, then it will revert to the value stored by IMAP. Reasons for this might be that the status was set by another copy of Thunderbird, or perhaps the local message database was corrupt and rebuilt.

Good filter , Junk filter : Classification was set by a message filter defined by the user.

The main point of these is to assist the user in diagnosing why emails were classified in a particular way.

Junk Analysis Detail

When you have selected a particular email in the message pane, then you can use a new menu action “Junk Analysis Detail” from either the main message menu, or the message context menu. This shows a screen with a listing of the tokens used in analyzing a particular email, along with the calculated percent probability than an email with that token is junk, and a running total of the total probablility calculation using all of the tokens. The screen looks like:

Sample junk analysis window

Junk Options

You can set critical overall junk options that have been added to the standard junk options screen. In Thunderbird, select Tools/Options/Security/Junk. In SeaMonkey, select Edit/Preferences/Mail & Newsgoups/JunQuilla. There you will see a display like this:

JunQuilla Options

“Junk Threshold” is the percentage value as calculated by the bayes classifier for each message, above which a message will be classified as junk. This should be set as low as possible, though always high enough to avoid having any real messages classified as junk. The default value of 90 is much too high for a well-trained bayes classifier.

The “Maximum token count” is a measure of the resources that the junk classifier will use. The higher it is set, the more accurate your classifications will be. The default value of 100,000 is probably too low for good classification performance. I’ve had good results with 300,000 – and JunQuilla will set your value to this when first installed. If this value is too high, and you have trained a lot of messages, then memory usage may be excessive.

The other parameters are read only, and are displays of values from your training file. The “Current token count” shows how many junk training tokens (which are like words) are currently in use. You probably won’t get good performance until this number is over 10,000 – and really should be more like 100,000.  “Good” and “Junk” messages trained shows how many messages have been used to train the junk filter. Ideally the number of junk and good messages should be more or less equal. If they are not, then pick some previously untrained messages and train them.

When the number of tokens exceeds the maximum value, then Mozilla mailnews will prune the training file in a large chunk, typically reducing both the number of trained messages, and the number of tokens, in about half.

Disable/enable junk processing for a folder.

You can set an “inherited folder property” to allow you to selectively enable or disable junk processing for folders. This has two main uses.

  • If you have server-side filters that process email in IMAP, then you may already know that certain folders contain either junk mail or good mail, and don’t want to waste time processing them locally – or take the risk that they will be processed incorrectly.
  • Mozilla mailnews core code now supports junk processing of RSS and News folders. You can select certain RSS or News folders, and then junk processing will run on new posts to those folders. This will also enable the standard user interface features that allow you to train messages as good or junk in those folders.

To set this, right click on a folder in the folder tree, and select Properties, then the “General Information” tab. At the bottom, you will see this:

JunQuillaFolderProperties

“Analyze Junk” is an inherited folder property. What that means is that each folder can either gets its value from its parent, or can be set locally. The default value depends on the characteristics of the folder itself. So for example, this would be disabled in News by default, but enabled in IMAP. To change the value, first reset the “Inherit” checkbox, then set the value that you want in “Enabled”. If you change a value for a folder, then the value will also change for the children of that folder (assumming that they have the default “Inherit” checked.)

Toolbar “Is Junk” and “Is Good” button

You can add two new buttons to your toolbar – “Is Junk” and “Is Good”. Here’s what they look like, next to a standard “Junk/Not Junk” icon:

JunQuilla Is Good or Is Junk toolbar button

To add these buttons, right click on a toolbar, select “Customize”, then drag the buttons to the desired location.

The standard Junk button will always show as “Junk” when it thinks a message is good, and “Good” when it thinks that a message is junk. But that means that we can only classify a message as “Good” when it has been falsely classified as junk, and we never want our junk filter to do that. The “Is Good” button is meant to be used in the “Uncertain” folders to give you a means to train a message as “Good” there.