Forum

Share

Please consider registering
guest

Log In RegisterMembers
Or log in with

Register | Lost password?
Advanced Search:

— Forum Scope —



— Match —



— Forum Options —




Wildcard usage:
*  matches any number of characters    %  matches exactly one character

Minimum search word length is 4 characters - maximum search word length is 84 characters

Topic RSS
Usenet News and Token Count
November 25, 2009
9:11 pm
Denmark
New Member
Forum Posts: 2
Member Since:
November 25, 2009
Offline

Hi, I use JunQuilla for email as well as about a dozen usenet news-groups. I wonder if it would be a good idea to increase the "Maximum Token Count" from 300.000 to a higher number when I use it for news as well?

Share
November 25, 2009
10:11 pm
Admin
Forum Posts: 323
Member Since:
July 12, 2008
Offline

There is very little experience with newsgroup management of junk mail in Thunderbird, so I would be happy to hear your experiences. In general though, the more tokens the better, so if you are not having performance problems (that is speed issues), then you could increase that. But it takes roughly 10x more tokens to double accuracy, so there is clearly a point of diminishing returns here.

Currently, there is a core bug that prevents the tokenization of newsgroups messages from separating headers from the body, so it would be less accurate than email (plus it looks very different). But the newsgroup spammers are not as clever as those in email (they haven't had to be!) so overall my guess is it works OK. But I have little actual experience with it.

But to directly answer your question, yes it is probably a good idea, but don't expect dramatic changes.

Share
November 26, 2009
5:43 pm
Denmark
New Member
Forum Posts: 2
Member Since:
November 25, 2009
Offline

Thanks. I tried to raise it to 1 mio. and I see no slowdown on my system.

The problem with usenet is not only spam but also usenet-"trolls". I hope this filter can automatically sort it out eventually, but I am still unsure if it will be 100% effective against that as well.

I am subscribed to a group with lots of messages, much useful information but also lots of "trolls" and noise. I tried marking everything good/bad (about 500 messages). Already today I think its pretty good at spotting the junk, but offcourse there is a few false positives/negatives.

Share
November 26, 2009
9:41 pm
Admin
Forum Posts: 323
Member Since:
July 12, 2008
Offline

"Thanks. I tried to raise it to 1 mio. and I see no slowdown on my system."

You would not see it right away. As you train, the number of tokens in use goes up. Eventually you will hit the limit, then the algorithm to prune the database kicks in, which will reduce the number of tokens by about half all at once. So it might take months or years for you to increase the number of tokens up to the point that such a large limit starts to affect your system. (You can see the number of tokens currently in use with JunQuilla, by the way, under tools/options/security.)

"The problem with usenet is not only spam but also usenet-"trolls"."

What you are describing is closely related to junk, but is a little different. You would get a better result if you used a bayes filter trained specifically for that purpose. In TB3, the internal bayes filter now supports multiple such characteristics (which I call "traits"), but there needs to be user interface provided to it. My extension TaQuilla was a first attempt to do that, but that extension is now effectively obsolete. I hope to soon get back to it to see if perhaps we could support use cases such as you are describing.

Share
Forum Timezone: UTC -8

Most Users Ever Online: 18

Currently Online:
5 Guest(s)

Currently Browsing this Page:
1 Guest(s)

Top Posters:

bobkatz: 8

BigMike: 8

t2m: 7

zabolyx: 7

taa: 6

onlyme: 6

Member Stats:

Guest Posters: 130

Members: 565

Moderators: 1

Admins: 1

Forum Stats:

Groups: 1

Forums: 7

Topics: 231

Posts: 802

Moderators: rkent (323)

Administrators: rkent (323)