Maia Mailguard

A Spam and Virus Management System

Version 1.0.2a


Thank you ReachOne Internet for hosting our website!

Caveats, Quirks, and Limitations

Maia Mailguard is a fairly complex collection of Perl and PHP code, any given portion of which may contain bugs and inefficiencies. Add to that the fact that this package is dependent on a number of other packages (e.g. amavisd-new, SpamAssassin, Vipul's Razor, DCC, Pyzor, PEAR::DB, PEAR::Mail_Mime, JpGraph, etc.), and it should be easy to see that there are bound to be "behavioural quirks" and limitations that result. Some of the more obvious of these are listed here, and you should judge for yourself whether Maia Mailguard is likely to be able to meet your needs in spite of them.

This list can also be seen by other developers as opportunities to contribute to the Maia Mailguard project by providing solutions to some of these issues.

A dual-MTA arrangement or re-injection mechanism is required

Maia Mailguard requires that you (at least conceptually) have two SMTP servers in your mail system--one accepting inbound mail (your so-called MTA-RX), and one issuing outbound mail (your MTA-TX). In between these two servers sits amavisd-new, which does all of its scanning and filtering before it relays mail from MTA-RX to MTA-TX. The README.sendmail-dual file from the amavisd-new documentation illustrates this concept rather well, and it applies not just to sendmail but to other MTAs as well.

In practice, the same thing can be achieved using a single SMTP server, provided it has the ability to "re-inject" mail to itself on a different port, such that the mail sent to this second port does not get filtered. Postfix is one such mailer, and the README.postfix file from the amavisd-new documentation explains how to achieve this.

The reason a dual-MTA (or re-injection) arrangement is required by Maia Mailguard has to do with the "rescue" feature, which allows users to resubmit a quarantined e-mail for delivery. For this to work, Maia Mailguard needs to be able to submit the mail to an SMTP server in such a way that that mail will not be content-filtered again by amavisd-new. By submitting that mail to MTA-TX, which is downstream from amavisd-new, we ensure that this is always the case.

Performance implications for high-volume sites

Maia Mailguard offers a lot of bells and whistles, all of which require system resources, which may be in short supply on a really busy site. Some of this can be mediated with careful configuration settings, while other performance improvements can be made through careful arrangement of the network itself.

Hosting amavisd-new and the database server on the same machine is one way to optimise performance, provided the machine is not so overloaded that the extra overhead of a database server actually slows things down. Your PHP-based web server can be located anywhere on the network, relative to the database server and the mail server, but clearly performance will be better when this network distance is "closer", rather than "farther", particularly between the web server and the database server (the critical I/O path, as far as user responsiveness is concerned). Remember, a few extra seconds to deliver an e-mail is forgivable, but a few extra seconds to deliver a web page is intolerable!

Optimizing the database server's performance is vital in a high-volume environment, since the content filters and the web server both rely on the database. Choosing proper table types (e.g. InnoDB for MySQL) and tweaking other database-specific performance-related parameters can make an enormous difference in your mail system's throughput. In particular, you want to avoid table-locking scenarios, which can occur under heavy loads with certain databases. MySQL's InnoDB table type is a good choice for that reason alone--it provides better read/write concurrency without table locking.

One simple way to boost performance is to use a ramdisk (e.g. ramfs) to serve as amavisd's workspace. If you have the memory to spare, this can speed up the process of unpacking, decoding, uncompressing, and scanning e-mail components, which would otherwise have been done on a slower, disk-based filesystem.

In general, high-volume sites are usually more I/O-bound than processor-bound. You can improve your system's performance by spreading your data across multiple drives with RAID-style disk-striping, which effectively increases the number of concurrent reads and writes you can perform on your data.

High-volume sites may also choose not to allow the JpGraph charts to be generated dynamically, since these incur a significant resource hit every time a user requests a new chart. By setting an auto-generation interval, however, you can have these charts generated in batches and displayed statically to users, which is a compromise between the dynamic chart generation and disabling charts altogether.

In spite of all this, the nice thing about Maia's design is that it's a scalable solution. If you determine that you need to spread out the content-filtering load further to relieve an overloaded machine, you can always add another box in parallel, and load-balance the traffic across both boxes. With both boxes pointing to the same Maia database (on a third box), the change is transparent. Add as many content filtering boxes as you need, it's all the same to Maia.

In an array scenario like the one above, the Bayes database is the only part that needs to be dealt with by special means. With SpamAssassin 2.63 and earlier versions, the Bayes database is filesystem-based, so if you want all of your content filtering boxes to share a common Bayes database, you either have to set up NFS, Samba, or some other network file system to let them all access and update the database. Alternatively, you can set up a master-slave arrangement, where all learning takes place on the master, and the Bayes database files are copied (using rsync) at regular intervals to read-only slaves. With SpamAssassin 3.00 and later, the Bayes database can be stored in an SQL database, which solves this problem much more cleanly.

Only one Bayes database for all users

SpamAssassin, when used on its own, creates a separate Bayes database for each user, so that the idiosyncracies of a given user's e-mail can be used to learn patterns that are optimized for that particular user. When SpamAssassin is called by amavisd-new, however, it is not used in the same way. Only one Bayes database is maintained, and it learns from the collective e-mails of all the users on the system. Some people see this as a flaw, usually because they don't really understand how Bayesian classifiers work.

The notion that a per-user Bayes database is required to account for individual tastes is something of a myth. The mathematical Bayes models weren't designed with spam-filtering in mind, specifically, and therefore cannot rely on any assumptions about the type of tokens (text in this case) being analyzed. According to the theoretical model, if User A and User B both receive an e-mail entitled "Fwd: Available All. V1@Gra = :V:alium , :XANAX: _ V1co+din Pnt.e.rmin ( :Soma: xbngnhlkqjwn", they have independent probabilities of deciding that the mail is spam. In practice, we know that if we showed that subject line to 100 people, practically all of them would agree that it was spam. A site-wide Bayes database is therefore going to reflect that fact, which is exactly what we would want.

A site-wide Bayes database has another key advantage, too (besides taking up a much more practical amount of disk space)--it learns a lot faster than a bunch of individual user databases. If you've got 100 users with personal Bayes databases, not only is much of that data needlessly repeated, but each of those databases has only seen a relatively small mail sampling. If you have a single, site-wide Bayes database, though, it has seen all of the mail for all of the users, and will reflect their collective tastes much faster, as they confirm their spam and ham.

Mail Viewer doesn't always handle malformed e-mail well

The Mail Viewer allows users to take a look at the contents of a quarantined/cached mail item, either in decoded form or in its raw form. While the raw form always displays reliably, sometimes the decoded form looks rather ugly, often because the e-mail itself had malformed MIME structure, or malformed HTML in one of its parts. This is generally not a serious problem, since the Mail Viewer is really only designed to help the user determine whether the mail is spam or something legitimate that needs to be rescued, but it's something to be aware of nevertheless.