Soft tags (TaQuilla) are starting to look exciting

By | January 1, 2009

Today for the first time I am able to run TaQuilla on my main email profile. This extension uses the same statistical bayesian filters used for spam detection, to instead automatically apply tags to emails. My initial trials are very encouraging.

I enabled soft tags on a “Personal” tag, to track emails that are associated with my personal life. I trained about 10 emails – a few from a current thread with my family, a message from my church, and a Netflix notice. I picked also a few bugmails and trained them as not personal. Then I applied the soft tags to another 20 emails – and it correctly decided when to apply the Personal tag to all of them! I have no idea what information it is using to do this, though I suspect it is just the names of the family members. But it is quite exciting that it works!

Unfortunately I had to apply a post-TB beta1 patch for this to work, so TaQuilla will not be able to work with beta1. My goals right now are to continue to figure out which hooks I will need in the base code so that TaQuilla will work well, and try to get them all in before beta2.

One interesting observation is that the training of the bayesian filter, which is a major pain to maintain for the case of junk mail, is much easier with soft tags. The reason is simple. For junk mail, the goal is to put junk messages into a “junk” folder, which ideally you never look at. For that reason, it is critical to have a very low chance of falsely classifying a message as junk. But for soft tags, you can accept a few false positives. So we can do the training for messages that match the category “Personal”  (which I call the pro trait) when you simply tag the message. I can then train messages that do not match the tag (which I call the anti trait) when you untag a message that was falsely tagged by the algorithm. This is very simple and unobtrusive. What a difference it makes to accept a few false positives!