Expiring obsolete tokens from your Bayes database

Over time, your Bayes database will grow to include a large number of tokens (in the bayes_tokens table) and references to those tokens (in the bayes_seen table). Some of these tokens may only have been seen once, or a small handful of times, making them relatively useless at discriminating spam from non-spam, so they're effectively just taking up space. As you watch your database tables grow, you may periodically wish to do a bit of maintenance to purge the Bayes database of these more useless tokens.

First, as the amavis/maia user, do an assessment of the items in your Bayes database:

sa-learn --dump magic

SpamAssassin uses an auto-expiry mechanism to do this sort of pruning for you at intervals. There's a somewhat complicated formula that it uses to determine the conditions under which it will perform this expiry, but you can always force an expiry manually with:

sa-learn --force-expire

Note that while this forces an expiry run to take place, it doesn't guarantee that any tokens will actually be deleted. The expiry logic still gets used (quoted from the sa-learn man page):

* figure out how many tokens to keep. take the larger of either bayes_expiry_max_db_size * 75% or 100,000 tokens. therefore, the goal reduction is number of tokens - number of tokens to keep.

* if the reduction number is < 1000 tokens, abort (not worth the effort).

* if an expire has been done before, guesstimate the new atime delta based on the old atime delta. (new_atime_delta = old_atime_delta * old_reduction_count / goal)

* if no expire has been done before, or the last expire looks "wierd", do an estimation pass. The definition of "wierd" is:

  • last expire over 30 days ago
  • last atime delta was < 12 hrs
  • last reduction count was < 1000 tokens
  • estimated new atime delta is < 12 hrs
  • the difference between the last reduction count and the goal reduction count is > 50%

In your local.cf file, you can control some of the expiry behaviours with the following settings:

bayes_auto_expire (default: 1)
bayes_expiry_max_db_size (default: 150,000 tokens)

You can reduce bayes_expiry_max_db_size, of course, but first try forcing a manual expiry run (sa-learn --force-expire) to see whether that turns out to be enough to do the job. For example, here's what I just generated on one of my test machines:

$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0     317098          0  non-token data: nspam
0.000          0      38892          0  non-token data: nham
0.000          0     369488          0  non-token data: ntokens
0.000          0 1143476216          0  non-token data: oldest atime
0.000          0 1143719728          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1143701416          0  non-token data: last expiry atime
0.000          0     225259          0  non-token data: last expire atime delta
0.000          0       7911          0  non-token data: last expire reduction count

$ sa-learn --force-expire
expired old bayes database entries in 54 seconds
263622 entries kept, 105954 deleted
token frequency: 1-occurrence tokens: 87.28%
token frequency: less than 8 occurrences: 9.37%

$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0     317106          0  non-token data: nspam
0.000          0      38892          0  non-token data: nham
0.000          0     263661          0  non-token data: ntokens
0.000          0 1143677330          0  non-token data: oldest atime
0.000          0 1143720210          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1143720129          0  non-token data: last expiry atime
0.000          0      43200          0  non-token data: last expire atime delta
0.000          0     105954          0  non-token data: last expire reduction count

As you can see from this example, 87% of the tokens in this Bayes database had only one occurrence, and 9% occurred only 2-7 times, so the vast majority of the tokens were pretty useless. Only about 4% of the tokens in the database were making any difference in terms of influencing the naive-Bayesian algorithm one way or the other. A token that appears only a handful of times in the database isn't enough evidence to sway SpamAssassin much beyond the 50% mark (the "I don't know" level), so those tokens are just taking up space. Consequently, those are the first tokens to be deleted when an expiry takes place.

The man page for sa-learn has a lot more information about the token expiry process, if you're curious, or a compulsive tweaker ;)


Back to FAQ