Topic RSS
8:45 pm
April 12, 2011
OfflineIt would certainly rock if on the Junk Analysis Detail window I could edit the Token % and have that saved in the training.dat file. This would enhance Thunderbirds junk email filter enormously. Mine is just so radicaly dumb and I see so many innocuous tokens rather 90%+ it's not funny. I'd like to drag them down so it doens't gobble up my mails as spam …
I wonder if it would simple to add such a feature to Junqilla. Presently the details dialog seems a little impotent, I can see why the spam detection is broken but I can't teach it …
My experience at the moment with the bayes algorithm in Thunderbird is that, once you add the capability to see which messages are uncertain and regularly train them to be junk or not junk, then the algorithm actually works pretty well in spite of its difficulties in individual tokens. I sincerely doubt that you will be able to have a significant influence on the success rate by manually adjusting individual tokens.
For example, on April 12 I received locally 181 spam emails (probably another 100 or more were already rejected by upstream spam assassin filters). My cutoff of automatic spam is 75%, which allows me to never have false positives. I had to mark 4 messages as junk manually. Of my 16 good messages (that were not on mailing lists which I filter directly), I marked 3 as good manually, and all three were mass mailing of some sort that were not marked as junk, but just had junk percent scores > 10%. This is quite typical for me, an error rate of 4 / (181 + 16) = 2%. So with proper training, the bayes filter works well. Now I have over 10,000 trained messages accumulated over years, but that is what it takes. (I also have modified the header tokenization to analyze the individual spam assassin tokens, as described in http://mesquilla.com/2010/02/1…..massassin/ )
Tokens with odd values generally fall into two camps. One are rare tokens, so it does not have much data and is not really important to fix it. Ironically, at the moment the token "tokens" in my system has a 95% spam probability, which is probably ridiculous but not important.
More significant are the tokens that appear in every email. If you are mostly training junk for example, and suddenly your email provider makes a sudden change to the headers (say by updating the version of their spam assassin filter), then all of the emails containing that token (which is all of them) start accumulating a pro-spam result. This is only corrected when you train a few messages as good that also contain this new header. For that reason, it is important to always train good and junk emails at about the same rate.
So in short, I don't think that adjustment of individual scores would be a worthwhile use of time. It would not be hard to program it, but I have no interest in doing so. Accurate, regular training is the key, and if I were to put more effort into this, I would probably investigate various methods of automatic training.
rkent
Most Users Ever Online: 18
Currently Online:
9 Guest(s)
Currently Browsing this Page:
1 Guest(s)
Member Stats:
Guest Posters: 130
Members: 565
Moderators: 1
Admins: 1
Forum Stats:
Groups: 1
Forums: 7
Topics: 231
Posts: 802
Newest Members: Matteo, p.dobrogost, gaute, Mythobeast, terry, Livraria Notre Dame
Moderators: rkent (323)
Administrators: rkent (323)
Log In
Register
Members
Home
Add Reply
Add Topic

Quote
Recent Comments