I wondered if you considered using MaxEnt instead of what I presume is Naive Bayes?
Although this would probably make training a lot more time-intensive, it could be done in the background (as most users' resource usage is minimal) and in batches when sufficient number of new e-mails has been classified.
Then features like number of links in an e-mail, average length of sentences, etc. could be incorporated, which would probably improve the result.
Otherwise thumbs up for this. I almost never use beta's, but this pretty much compels me to go for it.
"I wondered if you considered using MaxEnt instead of what I presume is Naive Bayes?"
TaQuilla uses the same code as is currently used in the internal TB junk processor, which is Naive Bayes. At the moment I do not have any plans to change that or allow other options.
"I almost never use beta's, but this pretty much compels me to go for it."
I have not updated TaQuilla for a couple of betas, and at the moment it is a little out of date. If you are at all risk-averse, I do not recommend using it until I have done some updates on it.