A spam filter is a diagnostic test, very much like the kinds of tests the medical profession uses to determine how well a given test works at predicting a medical condition. A physician, for example, might check a list of symptoms against a patient in order to determine whether the patient has a particular disease, and it becomes important to know how "predictive" that symptom list really is. If you have a runny nose, how likely is it that you have a cold? Or conversely, if you have a cold, how likely are you to have a runny nose?

It's no different with spam, really. In our case we use rules ("symptoms") as our tests, and what we're trying to determine ("diagnose") is whether the e-mail ("patient") is spam ("diseased") or non-spam ("healthy"). SpamAssassin applies a battery of 800+ tests for this purpose, and does a very good job overall at identifying spam, but like any diagnostic process, we want to know how well it's doing.

Fortunately the scientific community has some well-established statistical tests to apply to diagnostic processes like ours, so there's no need for us to reinvent the wheel.

A False Positive in our case is a non-spam item that was mistakenly classified by SpamAssassin as spam. SpamAssassin's tests told us the item was spam, when in fact it was non-spam. With a spam filter, this is generally considered to be the worst kind of failure, since it can result in legitimate mail being quarantined, delaying its delivery. [False Positive Rate = FP / (ham + spam + FP + FN)]

A False Negative is the reverse--a spam item that was classified mistakenly as non-spam, and allowed to slip through the filter. SpamAssassin's tests suggested the item was non-spam, but it turned out to be spam. The practical impact of a false negative error is more spam in your mailbox. [False Negative Rate = FN / (ham + spam + FP + FN)]

Sensitivity is the "true positive" rate. If it's actually spam, how likely is SpamAssassin to say it's spam? How much of the spam gets correctly identified as such? This is a measure of how accurately SpamAssassin identifies spam, but ignores its performance with regard to non-spam. [Sensitivity = spam / (spam + FN)]

Specificity is the "true negative" rate. If it's actually non-spam, how likely is SpamAssassin to say it's non-spam? How much of the non-spam gets correctly identified as such? This measures how well SpamAssassin identifies non-spam, ignoring its performance with regard to spam. [Specificity = ham / (ham + FP)]

PPV is the Positive Predictive Value. If SpamAssassin says it's spam, how likely is it to actually be spam? If we only look at cases where SpamAssassin predicted spam, how often was it right? The more specific the test, the higher the PPV. [PPV = spam / (spam + FP)]

NPV is the Negative Predictive Value. If SpamAssassin says it's non-spam, how likely is it to actually be non-spam? If we only look at cases where SpamAssassin predicted non-spam, how often was it right? The more sensitive the test, the higher the NPV. [NPV = ham / (ham + FN)]

Efficiency is the ratio of true positives and true negatives to total mail items processed--that is, the percentage of mail that was correctly classified. This is the best "overall" measure of a spam filter's performance, and it's what most people expect a vendor's claim to represent. [Efficiency = (spam + ham) / (spam + ham + FP + FN)]

An ideal filter would be both highly sensitive and highly specific, but this is rarely achievable in the real world. Some sort of tradeoff must be made, and since most people consider false positives much worse than false negatives, the SpamAssassin developers have biased their scoring system to favour false negatives by a factor of about 10:1. It's also true that the vast majority of SpamAssassin's rules are designed to recognize spam, with very few rules designed to recognize non-spam, and this means that you're likely to see higher Sensitivity than Specificity.

As you can imagine, the subtle differences among these statistics are often exploited by vendors when they advertise the performance of their hardware or software filters. While the Efficiency statistic is probably the most comprehensive measure of performance, it may not always be the one with the highest value, and a vendor might choose to advertise its PPV statistic instead, if that statistic is more impressive (as it often is). When comparing statistics, then, make sure you're comparing apples to apples--find out from the vendor which performance measure they're actually using.

Using SpamAssassin in a Maia Mailguard context, it's not unreasonable to expect to see Efficiency figures above 99% in a well-tuned setup. By taking advantage of a broad range of tests (e.g. Razor, Pyzor, DCC, SpamCop, SURBL, SARE rulesets, Bayes, etc.), keeping rules current (e.g. with sa-update and RulesDuJour), and encouraging users to diligently confirm their spam and non-spam, that Efficiency statistic can exceed 99.5%.

Use the statistics Maia provides to help you fine-tune your installation, aiming to balance Sensitivity and Specificity as best you can, while increasing overall Efficiency. You don't want to end up with a filter that overzealously detects spam (at the cost of lots of false positives), or one that lets too much spam through. If your numbers are lopsided, consider adding a wider range of tests, and examine your spam threshold to make sure it isn't unreasonably high or low (SpamAssassin's standard rule scores are calibrated based on a spam threshold of 5.0). If you suspect that your users may be neglecting their quarantines and caches, consider enabling SpamAssassin's auto-learning mechanism. Be patient, though; it may take several days--or even weeks--for the statistics to reflect the changes you make, particularly if your filter has processed millions of items.

Back to FAQ

Last modified 16 years ago Last modified on May 5, 2006, 3:32:22 AM