The script

When users "confirm" spam or ham using Maia Mailguard, the mail items are marked appropriately as such to await processing by this script.

With confirmed spam and ham distinguished in this way, you can use this script to post-process these items at regular intervals (using a scheduler like cron). At the very least you can use this to train the site-wide Bayes engine with specific examples of known spam and known ham. Optionally, if you have Razor2, DCC, Pyzor, and/or SpamCop configured in SpamAssassin, you can use this script to also report spam to these networks. Note that you must configure Razor2/DCC/Pyzor specifically for reporting, as these services often require reporters to be registered with them. See the Razor2/DCC/Pyzor/SpamCop documentation for more details about reporting spam.

When the script is done training the Bayes engine and reporting the spam items, the spam and ham items are deleted, which helps to keep the database from getting filled up.

This script should be scheduled to run on an hourly basis, to make sure that learning and reporting takes place frequently, since these are important steps in maintaining/improving the effectiveness of your spam filter.

As of Maia 1.0, the script is now assisted by a worker script called The main script calls the worker script in a loop until there are no further items left to be processed. To invoke the new script, you must supply some command-line arguments:

--ham-only   : only process confirmed ham (non-spam) items
--spam-only  : only process confirmed spam items
--learn      : train the Bayes database 
--report     : report spam to Razor2/DCC/Pyzor/SpamCop
--no-razor   : don't report to Razor
--no-pyzor   : don't report to Pyzor
--no-dcc     : don't report to DCC
--no-spamcop : don't report to SpamCop
--limit=n    : process items in batches of n at a time (1-127)
--max-size=n : items larger than n (bytes) will not be processed
--help       : display this help text
--debug      : display detailed debugging information
--quiet      : display only error messages

Learning and reporting are separate actions, but you can combine both, of course, e.g. --learn --report

The --ham-only and --spam-only options allow you to process just one type of item in a given run, but for most purposes you should only need to worry about the --learn and --report options. Note that if you don't specify either --learn or --report, the script will simply delete the items it processes, which can be useful on occasion is you just want to purge your database without training your Bayes database or reporting spam.

The following settings in the /etc/maia.conf file affect how the script behaves:

$pid_file tells the script where to look for--and write--a lock file containing its process ID. This lock file prevents multiple instances of the script from running, in the event that a processing run takes long enough to overlap the next scheduled run.

# Location to write the lock/PID file (must be writeable by your
# amavis user)
$pid_file = "/var/amavisd/";

$default_limit sets the number of mail items that will be processed at a time. This can be a number from 1 to 127, and defaults to 5 if you don't set it yourself, or override it on the command line (with the --limit=n option). Using larger batch sizes is more efficient, but consumes more system resources (especially memory), so it's usually advisable to stick to batch sizes of 5 or 10 in most cases.

# Maximum number of spam/non-spam items to process at a time (1-127).
$default_limit = 5;

$key_file is the location of your Blowfish key file, if you're using Maia Mailguard's encryption feature. This is needed in order to decrypt the encrypted mail items before learning and/or reporting them. If you're not using the encryption feature, set this to undef.

# Location of your encryption key file, or undef to disable
#$key_file = "/var/amavisd/blowfish.key";
$key_file = undef;

$default_max_size prevents oversized items from consuming a huge amount of system resources. Items larger than this number of bytes will not be used for learning or reporting, since doing so would put a severe strain on SpamAssassin and slow your server to a crawl. Don't worry, though--it's in spammers' best interests to keep spam small, so that they can send out as many items as possible in a short amount of time, so almost all spam tends to be smaller than about 256 kB. You can also set this with the --max-size=n command-line option.

# Items larger than this size (in bytes) will not be learned/reported.
$default_max_size = 256*1024;

$learning_options tells the script whether it should use the user-confirmed spam and non-spam to train the Bayes database. If you're using SpamAssassin's Bayes database, you almost certainly want this enabled. This can also be specified on the command-line with the --learn option, if you don't specify it here.

# Train the Bayes database?
#    0 = no
#    1 = yes (same as --learn)
$learning_options = 1;

$report_options determines whether the script should try to send spam reports to Razor, Pyzor, DCC, and/or SpamCop. The SpamCop reporting tool is already built into SpamAssassin 3.x, but Razor, Pyzor, and DCC all require external applications be installed first. While it is possible to use these services for lookups without reporting to them, if you can spare the bandwidth sending spam reports is a good way to "give back" to the anti-spam community.

NOTE: A small bug in SpamAssassin versions up to 3.1.4 makes it impossible to report to some services and not others--it's an all-or-nothing situation at present, unless you apply the three small patches I've attached to Ticket #288. With those patches applied, reporting works as intended. This should be corrected in SpamAssassin 3.1.5.

# Reporting options (add values together as desired):
#    0 = none (don't report spam)
#    1 = report to Razor
#    2 = report to Pyzor
#    4 = report to DCC
#    8 = report to SpamCop
#$report_options = 0;
$report_options = 1 + 2 + 4 + 8;
Last modified 16 years ago Last modified on Aug 17, 2006, 5:24:07 AM