Forum

Share

Please consider registering
guest

Log In RegisterMembers
Or log in with

Register | Lost password?
Advanced Search:

— Forum Scope —



— Match —



— Forum Options —




Wildcard usage:
*  matches any number of characters    %  matches exactly one character

Minimum search word length is 4 characters - maximum search word length is 84 characters

Topic RSS
How can the Token % be modified manually
April 12, 2011
8:45 pm
Hobart
New Member
Forum Posts: 2
Member Since:
April 12, 2011
Offline

It would certainly rock if on the Junk Analysis Detail window I could edit the Token % and have that saved in the training.dat file. This would enhance Thunderbirds  junk email filter enormously. Mine is just so radicaly dumb and I see so many innocuous tokens rather 90%+ it's not funny. I'd like to drag them down so it doens't gobble up my mails as spam …

 

I wonder if it would simple to add such a feature to Junqilla. Presently the details dialog seems a little impotent, I can see why the spam detection is broken but I can't teach it …

Share
April 13, 2011
9:19 am
Admin
Forum Posts: 323
Member Since:
July 12, 2008
Offline

My experience at the moment with the bayes algorithm in Thunderbird is that, once you add the capability to see which messages are uncertain and regularly train them to be junk or not junk, then the algorithm actually works pretty well in spite of its difficulties in individual tokens. I sincerely doubt that you will be able to have a significant influence on the success rate by manually adjusting individual tokens.

For example, on April 12 I received locally 181 spam emails (probably another 100 or more were already rejected by upstream spam assassin filters). My cutoff of automatic spam is 75%, which allows me to never have false positives. I had to mark 4 messages as junk manually. Of my 16 good messages (that were not on mailing lists which I filter directly), I marked 3 as good manually, and all three were mass mailing of some sort that were not marked as junk, but just had junk percent scores > 10%. This is quite typical for me, an error rate of 4 / (181 + 16) = 2%. So with proper training, the bayes filter works well. Now I have over 10,000 trained messages accumulated over years, but that is what it takes. (I also have modified the header tokenization to analyze the individual spam assassin tokens, as described in  http://mesquilla.com/2010/02/1…..massassin/ )

Tokens with odd values generally fall into two camps. One are rare tokens, so it does not have much data and is not really important to fix it. Ironically, at the moment the token "tokens" in my system has a 95% spam probability, which is probably ridiculous but not important.

More significant are the tokens that appear in every email. If you are mostly training junk for example, and suddenly your email provider makes a sudden change to the headers (say by updating the version of their spam assassin filter), then all of the emails containing that token (which is all of them) start accumulating a pro-spam result. This is only corrected when you train a few messages as good that also contain this new header. For that reason, it is important to always train good and junk emails at about the same rate.

So in short, I don't think that adjustment of individual scores would be a worthwhile use of time. It would not be hard to program it, but I have no interest in doing so. Accurate, regular training is the key, and if I were to put more effort into this, I would probably investigate various methods of automatic training.

rkent

Share
Forum Timezone: UTC -8

Most Users Ever Online: 18

Currently Online:
9 Guest(s)

Currently Browsing this Page:
1 Guest(s)

Top Posters:

bobkatz: 8

BigMike: 8

t2m: 7

zabolyx: 7

taa: 6

onlyme: 6

Member Stats:

Guest Posters: 130

Members: 565

Moderators: 1

Admins: 1

Forum Stats:

Groups: 1

Forums: 7

Topics: 231

Posts: 802

Moderators: rkent (323)

Administrators: rkent (323)