wiki:NewThresholds

New Threshold Proposal

Proposed by Robert LeBlanc?

A major overhaul of amavisd-maia involving tickets #55, #99, #163, #164, #191, and #200 will be underway in the trunk which is now geared for 1.1 release, though not necessarily the way it's been proposed in those tickets. This proposal would also replace the "4th level" patch from the stats branch with an equivalent but more flexible mechanism.

Score Ranges

Breaking the score spectrum down into sections, we have four useful ranges to consider:

High-Probability Ham (HPH): items that score so low on the spectrum that we're effectively certain they're ham, even without human confirmation. These are the items we rarely bother to look at when we skim our ham cache, so in effect we often allow them to be automatically confirmed as such.

Low-Probability Ham (LPH): items that score below the spam threshold, but not so low that we can confidently diagnose them as ham without human assistance. These items benefit from human inspection, as false negatives are likely within this range.

Low-Probability Spam (LPS): items that score above the spam threshold, but not so highly that we can confidently diagnose them as spam without human assistance. These items benefit from human inspection, as false positives are likely within this range.

High-Probability Spam (HPS): items that score so highly on the spectrum that we're effectively certain they're spam, even without human confirmation. These are the items we rarely bother to look at when we skim our spam quarantine, so in effect we often allow them to be automatically confirmed as such.

Thresholds

The three thresholds that define the four score ranges are:

[Tham] Ham action threshold: at or below this score, mail is deemed with high confidence to be ham, so we can take some sort of automated action based on that diagnosis. If auto-learning is enabled, [Tham] should correspond to bayes_auto_learn_threshold_nonspam.

[Tpivot] Spam discrimination threshold: at or above this score, mail is suspected to be spam; below this score, mail is suspected to be ham.

[Tspam] Spam action threshold: at or above this score, mail is deemed with high confidence to be spam, so we can take some sort of automated action based on that diagnosis. If auto-learning is enabled, [Tspam] should correspond to bayes_auto_learn_threshold_spam.

In terms of the score spectrum and its ranges, then:

HPH: Score <= [Tham]
LPH: [Tham] < Score < [Tpivot]
LPS: [Tpivot] <= Score < [Tspam]
HPS: Score >= [Tspam]

This effectively enforces the following relationship between the three thresholds:

[Tham] < [Tpivot] <= [Tspam]

Clearly, then, if [Tpivot] == [Tspam], the LPS range vanishes, and all spam is deemed to be HPS, which may be a useful side-effect in certain configurations.

Support for these three thresholds would be added to the policy table as follows:

ALTER TABLE policy
    CHANGE spam_tag2_level spam_level FLOAT DEFAULT '999',
    CHANGE spam_kill_level spam_action_level FLOAT DEFAULT '999',
    ADD COLUMN ham_action_level FLOAT DEFAULT '-999';

This renames "spam_tag2_level" to "spam_level" and "spam_kill_level" to "spam_action_level", and adds a new "ham_action_level" column. The renaming, while not strictly necessary, serves to clarify the code considerably.

From the standpoint of the Maia GUI, the settings.php and domainsettings.php pages and their templates would have their spam filtering options modified to reflect the new thresholds:

It's almost always non-spam when score is <= [Tham]
It's probably spam when score is >=          [Tpivot]
It's almost always spam when score is >=     [Tspam]

These interface items would replace the current "Consider mail 'Spam' when Score is >=" and "Quarantine Spam when Score is >=" items.

Actions (formerly "Destinies")

Each of the four score ranges has a number of available actions that can be configured (e.g. with a drop-down selector) to take place on mail items that fall into that range:

Options for HPH:

  • [TH] Deliver and Learn from it (default)
  • [DH] Deliver and Discard it

Options for LPH:

  • [CH] Deliver and Cache it (default)
  • [DH] Deliver and Discard it

Options for LPS:

  • [CS] Deliver and Cache it
  • [QS] Quarantine it (default)
  • [DS] Discard it

Options for HPS:

  • [TS] Learn from it and Discard it (default)
  • [RS] Learn from it, Report it and Discard it
  • [DS] Discard it

In all cases except for the discard actions, the mail item is stored in the database. The differences between the other actions have to do with whether the mail is delivered or blocked, whether it appears in the quarantine/cache lists, and how it gets handled by the process-quarantine post-process.

[CH|CS] "Deliver and Cache it": for LPS this is equivalent to the current "labeling" option: deliver the LPS item to the recipient, so that a downstream process (e.g. his MUA) can do its own filtering based on either subject prefix or X-Spam headers. For LPH this is equivalent to the current behaviour when false negative management is enabled--items are delivered, but a copy remains stored in the database for reporting false negatives.

[QS] "Quarantine it": equivalent to the current "quarantining" option, blocking delivery of a LPS item until/unless it is released manually using the GUI or a link in a quarantine digest e-mail.

[DH] "Deliver and Discard it": HPH and LPH items get delivered, then the item is discarded. No mail is stored in the database however, so false negative reporting and Bayes training on these items is impossible.

[DS] "Discard it": LPS and HPS items are blocked from delivery and the item is quietly discarded. No mail is stored in the database however, so recovering, training and reporting is impossible.

[TH] "Deliver and Learn from it": HPH items are delivered and marked to be learned as ham by the Bayes database. The process-quarantine post-process does the actual training.

[TS] "Learn from it and Discard it": HPS items are blocked from delivery and marked to be learned as spam. HPS items do not appear in the spam quarantine list however. The process-quarantine post-process does the actual training. No reporting is performed for these HPS items, however.

[RS] "Learn from it, Report it and Discard it": identical to the "Learn from it and Discard it" action ([TS]), except that reporting is performed in addition to the learning process. Previously I would never have considered adding this option, since the usage terms of reporting services like Razor expressly forbade reports generated by automata, but it seems things have changed over the past year. DCC, Pyzor, and SpamCop?? no longer have any stated position on the matter, and even Razor has backed down its language from "must not" to "in general, should not". In practice, we certainly know that the probability of false positives drops off dramatically as the SpamAssassin score increases, ultimately reaching a vanishingly small number past a certain threshold. Provided [Tspam] is set conservatively, this option should be safe.

To add support for these four score ranges and their actions, we'd need to add four columns to the policy table:

ALTER TABLE policy
    ADD COLUMN hph_action CHAR(2) DEFAULT 'TH', -- 'TH', 'DH'
    ADD COLUMN lph_action CHAR(2) DEFAULT 'CH', -- 'CH', 'DH'
    ADD COLUMN lps_action CHAR(2) DEFAULT 'QS', -- 'CS', 'QS', 'DS'
    ADD COLUMN hps_action CHAR(2) DEFAULT 'TS'; -- 'TS', 'RS', 'DS'

The two-character codes that define each available action are described in detail above, but the complete list of action codes is:

  • CH: Cache as Ham
  • TH: Train as Ham
  • DH: Discard as Ham
  • CS: Cache as Spam
  • QS: Quarantine as Spam
  • TS: Train as Spam
  • RS: Report as Spam
  • DS: Discard as Spam

In terms of the Maia GUI elements, the actions for the four score ranges would be specified using a drop-down selector:

If it's almost certainly non-spam... [HPH Action selector]
If it's probably non-spam...         [LPH Action selector]
If it's probably spam...             [LPS Action selector]
If it's almost certainly spam...     [HPS Action selector]

These interface items replace the "Detected spam should be..." item. For consistency, it would be a good idea to change the radio buttons used for the other mail types (e.g. "Labeled", "Quarantined", and "Discarded") to drop-down selectors as well, with appropriate names (e.g. "Deliver and Cache it", "Quarantine it", and "Discard it").

Another small addition may be to give the superadmin the option on the System Configuration page to globally enable/disable the use of the auto-report option for HPS. That way a superadmin who wants to adhere to a strict policy of only reporting human-confirmed spam can do so.

An extra column in the maia_config table would take care of that:

ALTER TABLE maia_config ADD COLUMN enable_auto_reporting CHAR(1) DEFAULT
'N' NOT NULL;

A matching interface item for adminsystem.php and its template:

Allow users to auto-report high-scoring spam?  [ ] Yes  [x] No

Obviously when this setting is disabled, the "Learn from it, Report it and Discard it" (i.e. "RS") action option for HPS would not be available. The act of setting this to "No" should forcibly change any policy.hps_action instances of "RS" to "TS" (Train as Spam).

Whitelists and Blacklists

Currently, items that are whitelisted or blacklisted are not scanned by SpamAssassin, and not stored in the database. Because of the latter, these items are not available for learning or reporting, which is unfortunate since they're often excellent sources of ham and spam. The temptation many users experience to whitelist their entire address book is strong, but doing so deprives the Bayes engine of the opportunity to see what legitimate mail looks like.

On the other hand, there are occasions on which a user may wish to whitelist a sender specifically to avoid corrupting the Bayes database with tokens from spammy-looking mail (e.g. dirty jokes, discussions about sexual abuse, solicited advertising mail-outs, etc.). Such items should certainly not be used for Bayes training.

In compromising between these two contradictory requirements, I propose that we store whitelisted and blacklisted items in the database (though without spam-checking them at all) and assign an appropriate action code to indicate how these items should be handled later by the process-quarantine process. Four action codes are required:

  • TW: Train as Whitelisted
  • DW: Discard as Whitelisted
  • RB: Report as Blacklisted
  • DB: Discard as Blacklisted

The discard actions (i.e. "DW" and "DB") indicate that the item should be treated the same way current whitelisted ("W") and blacklisted ("L") items are--they're not stored at all. DW items are delivered and discarded, while DB items are just discarded.

The two new options--"TW" and "RB"--indicate that the items may be stored and used for Bayes training and reporting, as either ham (TW) or spam (RB).

This action code does not belong in the policy table, however, but in the wblist table, as a per-sender code, since it should be possible to use mail from some senders for training but not others:

ALTER TABLE wblist ADD COLUMN action CHAR(2);

For the sake of data integrity, if wblist.wb is "W", the only valid values for wblist.action are "TW" and "DW"; when wblist.wb is "B", wblist.action must be either "RB" or "DB". For backward-compatibility, if action is NULL, it should be assumed to be the appropriate discard ("DW" if wb is "W", "DB" if wb is "B").

From the Maia GUI's standpoint, the wblist.php page and its template would need an additional column added per row, for a checkbox, e.g.:

Train?
 [x]

The wblist.action column can then be determined based on the combination of the checkbox boolean and the radio button setting, i.e.:

Whitelist and Training = TW
Whitelist and !Training = DW
Blacklist and Training = RB
Blacklist and !Training = DB

The superadmin should be given the option on the System Configuration page to globally enable/disable the use of whitelisted and blacklisted items for Bayes training. That way a superadmin who wants to maintain the current whitelist/blacklist behaviour can do so.

An extra column in the maia_config table would take care of that:

ALTER TABLE maia_config ADD COLUMN enable_wblist_training CHAR(1)
DEFAULT 'N' NOT NULL;

A matching interface item for adminsystem.php and its template:

Allow white/blacklisted items to be used for training?  [ ] Yes  [x] No

Obviously when this setting is disabled, the "Train?" column should not be displayed on the wblist.php page. The act of setting this to "No" should forcibly change any wblist.action instances of "TW" to "DW" (Discard as Whitelisted), and any instances of "RB" to "DB" (Discard as Blacklisted).

New Mail Type Codes

Given the need to add more type codes, it starts to make sense to expand the type column of maia_mail_recipients to 2 characters to make the values more meaningful and less likely to promote confusion:

ALTER TABLE maia_mail_recipients MODIFY type char(2) NOT NULL;

The following two-character types would then be used:

  • VM: Virus/Malware? item (formerly V)
  • BF: Banned File Attachment (formerly F)
  • IH: Invalid Mail Header (formerly B)
  • XX: Unknown/In?-process

For spam-related items, the type code maps directly to the action code:

  • CH: Cache as Ham (formerly H)
  • TH: Train as Ham (formerly G)
  • CS: Cache as Spam
  • QS: Quarantine as Spam (formerly S)
  • TS: Train as Spam
  • RS: Report as Spam (formerly C)
  • TW: Train as Whitelisted item
  • RB: Report as Blacklisted item

There's no need to record the discard types (i.e. "DH", "DS", "DW", and "DB"), since those action codes indicate that the item should not be stored at all, and hence no entry in maia_mail_recipients would be created.

Human confirmation of a ham or spam item should set the type to "TH" (for confirmed ham) or "RS" (for confirmed spam), to indicate that these items are eligible for both learning and reporting (as determined ultimately by the process-quarantine options).

Last modified 12 years ago Last modified on Jun 22, 2008, 2:54:56 PM