I have a little black book with all of the projects I'm working on on and off and all the things I need to do for them, and when, as now, I don't have much else I should be doing, I go through and finish something else off. The project I've been playing with for the past couple of days has been SpamMonkey, my cut-down SA clone for blog/mail/whatever spam detection.
I got the idea at YAPC Europe, while battling blog spam and listening to Stowe's talk about using DNS BLs to block web spam. I coded up a basic SpamAssassin replacement in a couple of days and it's been deflecting a bit of spam from this very site. (Although not all of it, and I've other measures in use too.)
One thing that stops it being useful for mail too is the lack of RBL
support, and the reason I haven't done that is because there isn't
anything like SpamAssassin's Received header parsing, which you
really need for this job. So I've been working on
Email::Received.
Unfortunately I couldn't just pull out the relevant subroutine from SA because it's 900 lines of ugly code. So I've been trying to make it less ugly, by turning it into something data-driven rather than code-driven. To do this, I invented a little ad hoc language - basically AWK-with-a-vengence - and wrote all the parsing rules in that. So this:
if (/^\(/) { return; }
if (/\sid\s+([^\s<>;]{3,})/) { $id = $1; }
if (/ by .*? with (ESMTPA|ESMTPSA|LMTPA|LMTPSA|ASMTP|HTTP)\;? /i) { $auth = $1; }
becomes the slightly less horrific:
/^\(/ IGNORE "gateway noise";
/\sid\s+([^\s<>;]{3,})/ SET id = $1;
/ by .*? with (ESMTPA|ESMTPSA|LMTPA|LMTPSA|ASMTP|HTTP)\;? /i SET auth = $1;
I now have the functionality of 800 lines of code expressed in 200 lines of data. The data is easier to edit and to verify, and this process also makes it easier to detect when there is redundancy in the rules, which I found quite a bit. The next stage is probably to write a translator from this into Perl; although thinking about it, the rules are now decoupled from the code and it would be possible to use PCRE and generate a really fast Received line parser in C from this data. (I'm not going to do that.)
Little languages like this are a great way to turn code into data, and by developing them in an ad-hoc way you tend to produce a language that best expresses the task at hand - and with a little bit of Perl it's not too difficult to turn them into something executable, either interpreted or translated.
Full version - 2 Comments