Fixing an ugly email situation

xrayspx's picture
Music: 

I've been running IMAP services on my mailserver for many years, previously using Courier. I always had a pretty basic but solid-running system. Postfix, doing a Spam Assassin check, then delivers to the user folders, and Courier running IMAP.

I always had one client which I considered the "main" client, which held any filtering rules and did Bayesian spam filtering, since vanilla Spam Assassin does an "ok" job by default, but not "great". Over time I wound up moving that master mail client to an always-on Linux laptop which just sat closed on my desk running Thunderbird in an X11VNC session that I could open an ssh tunnel and VNC in to edit rules or move spam around. Ugly, but I never had time to attack the problem at the core until a week ago, now that I'm finally sick of the ugliness or running a whole machine for SSH access and mail sorting.

A couple of weekends ago I decided to Fix Every Problem. My main beef was that I need filters, the simplest way to attack that seems to be Sieve, and Courier seems not to support it. So I compiled the latest Dovecot and brought that up. To use Sieve, you must use the Dovecot MDA, not the default Postfix MDA, so that's an easy change. MailSieve was pretty straight forward, after figuring out the Dovecot configs.

RoundCube has a nice built-in editor for Sieve rules, which is good for my other users, it's convenient.

My next thing was "Well, SpamAssassin is OK, but how do teach it and train its internal Bayesian filter"? I found this script at CrazySquirrel and tweaked it to meet my needs. Basically you just create two folders, one to train for false negatives, and one to train for false positives. Then run the script every day (hour, whenever), and it will feed your training messages through sa-learn to train SA's Bayesian filtering. One important thing to note is that the filter won't do anything unless you feed it at least 200 of both Spam and Not Spam (ham). I fed a corpus of about 28,000 known spam messages which had gotten through the default Spam Assassin filtering, ones which had been mostly caught by my Thunderbird instance.

For the first week or so, I had the script run every hour and just moved mail into the training folder as I saw them and watched as it seemed to improve. One week in, I've reduced the frequency to daily and started looking at stats.

Today I received 231 messages which Spam Assassin correctly identified, I've yet to receive a false positive, and 50 false negatives, so 79% hit rate. This is at Spam Assassin's default threshold of 5.0. Looking at the false negatives, lowering this to 4 wouldn't result in very many more of those 50 being identified, and might result in legitimate mail getting tagged, so I'm inclined to leave it put for a bit.

It's nice to be finally stepping into the year 2000. I think the slightly worse, so far, Spam filtering is more than made up for by the convenience of management and that I can do all of it from my phone finally.

My guess is that it will take some time for SA to learn what took that Thunderbird instance like 4 years to learn, I moved that thing around to at least 4 machines on 2 OS platforms. Why did I bother? I think this whole thing took all of 4 hours to get working end to end.

Man I'm lazy.

Oh, and with this, I found that Apple's Mail.app uses the timestamp of the files on the IMAP server to determine "received date", so I have a bunch of mail with incorrect timestamps from the distant past. I just need to write a quick script to scrape the correct timestamps out of the headers and touch the files back to the right time. Thunderbird does not do this, I wonder if it was a speed thing for Apple?