wiki:sa-autolearn

Auto-learning with SpamAssassin

SpamAssassin offers an "auto-learn" mechanism for training your Bayes database automatically, so long as the mail being scanned scores conservatively enough. You can enable this feature and define these conservative thresholds in your local.cf file:

bayes_auto_learn                         1
bayes_auto_learn_threshold_nonspam    -5.0
bayes_auto_learn_threshold_spam       15.0

The bayes_auto_learn_threshold_nonspam setting defines the cutoff level for SpamAssassin to auto-learn a non-spam item. As long as the item scores at or below this threshold, it will be learned automatically as non-spam. Since there aren't as many rules designed to identify non-spam as there are for identifying spam, this threshold usually doesn't need to be far below 0; a value of -5 is plenty in most cases.

The bayes_auto_learn_threshold_spam setting works the same way for spam, except that in this case it applies to items that score at or above this threshold. A value of 15 or so is conservative enough in most cases, though you can also examine the system-wide statistics for your site to find out what the highest-scoring false-positive was, and use that as a starting point.

Mail that scores anywhere between those two thresholds will not automatically be learned by the Bayes engine, so they will need to be confirmed as spam or non-spam by human beings using Maia's web interface if they are to be used for learning purposes at all.

Why you might want to use auto-learning

Sometimes it can be difficult to get many of your users to regularly confirm their spam and non-spam, in spite of an easy-to-use web interface and the convenience of quarantine digest e-mails. If users neglect their quarantines and caches long enough, of course, their mail items eventually get expired by the expire-quarantine-cache.pl script. When that happens, those items are simply deleted, and cannot be used to train your Bayes database. By enabling auto-learning, though, you can have SpamAssassin train the Bayes engine using the most conservatively-scoring items at the time the mail is scanned. If your thresholds are conservative enough, you can be pretty sure that the Bayes engine is learning the right things, without needing human confirmation for those items. The advantage then becomes obvious: you're now able to get more value out of the items that Maia processes, whether your users are diligent about confirming them or not. You'll still end up throwing away unconfirmed items that score between the two thresholds, but it's certainly better than nothing.

A similar principle applies if you've configured Maia to discard spam, or to not cache non-spam. In these cases, the items aren't stored in Maia's database, so user-confirmation is impossible. But because SpamAssassin does its auto-learning at the time the mail is scanned, you can still get some Bayes training done with many of these items.

Why you might not want to use auto-learning

The auto-learning process takes place at the time SpamAssassin scans an e-mail, adding a number of milliseconds to the time it takes to complete the scan. Since it needs to scan the Bayes database for existing token matches, insert any new tokens found in the e-mail, and update token references, this delay can add up, particularly if your database server is located on a different host from the one running amavisd-maia. Rather than incurring a 50-200 ms delay every time SpamAssassin scans an e-mail, busy sites might prefer to disable auto-learning and have all Bayes training get done by the process-quarantine.pl script.

Other administrators may choose to disable auto-learning out of a concern about feedback errors. Auto-learning is a system that feeds back on itself to some extent, since SpamAssassin rules are used to determine the mail's score, and if that score then gets used to automatically determine whether the mail is spam or non-spam, and that in turn gets used to adjust the Bayes database that will get used to score future items, there's an opportunity for mistakes to get exaggerated over time. This is usually only a problem when the auto-learning threshold values are not set conservatively enough, however.

Still others may choose to disable auto-learning as a matter of principle, believing that only human-confirmed items should be used for training the Bayes database. This is most practical at sites with a relatively small number of users, and/or users who are diligent about confirming their spam and non-spam.

To disable auto-learning, set the following in your local.cf file:

bayes_auto_learn    0

NOTE: If you have bayes_auto_learn_threshold_nonspam or bayes_auto_learn_threshold_spam defined in your local.cf file, SpamAssassin will enable auto-learning, even if you have bayes_auto_learn set to 0. This is most likely a bug, but to be safe, make sure to remove (or comment-out) both of those threshold definitions if you want to disable auto-learning.


Back to FAQ

Last modified 17 years ago Last modified on Apr 25, 2006, 8:20:42 PM