Bad effects on junk training corpus from change

By | December 2, 2009

I’ve been tracking some difficulties in my junk analysis recently, which was caused when I enabled some experimental changes to tokenization. (I added full tokenization of the Received: and x-spam-status: headers). At the same time, I started some experiments where I am automatically training certain incoming emails as good.

What I am seeing is that the common, unchanging words in the Received: header, like “received:from” and “received:(exim”, are persistently occurring with a moderate “good” score, such as 36, even after training junk messages with those headers. There are a lot of these little meaningless tokens per message though, and they are dragging down the junk score of junk messages into the Uncertain category.

I think what is happening is this, and it could be caused by any change in your common environment. I started adding new tokens such as “received:from”, without restarting training. Because I also started training temporarily many more good messages than junk, these new tokens are showing up disproportionately as good.

Suppose, for example, I start with 1000 good messages and 1000 junk messages in my corpus, then suddenly add a new token to all incoming emails. Then I train 100 good messages with that new token, and 10 junk messages. The spam corpus will claim that the new token appears in 10% of good emails, but only 1% of junk emails, so the presence of that token is a marker that the message is more likely to be good. Which is not true, since now ALL emails have that new token!

I suppose that one defense against this would be to make sure that the proportion of good and junk emails trained always stays about the same. That is not easy to do, however.