wiki:OCRmain

Using OCR to detect image-based spam

As anti-spam tactics have improved over the years, some spammers are turning to image-based spam to try to bypass filters. Instead of including their messages in the body of the mail, they've resorted to producing images that contain this same text instead. These HTML-based emails, then, contain a single inline image, along with some randomly-selected text passages from public-domain literature likely to be inoffensive to spam filters.

Indeed, to most spam filters these emails look like legitimate mail that happens to include an image. Bayes is all but useless against this sort of tactic, since it can only operate on the text portion of the mail (which is of course innocuous). SURBL is also powerless, since there are no URIs to find in the mail--they're displayed in the image, albeit in non-clickable form.

What's needed, then, is the ability to extract the text from these images and process it. That's a job for OCR (Optical Character Recognition), and with a little effort it's possible to add this functionality to SpamAssassin (v. 3.1.1 and later).

The most effective and best-supported OCR plugin for SpamAssassin is FuzzyOcr, though you'll likely need to use the latest SVN version if you're using SpamAssassin 3.2.x or newer.

NOTE: Image spam seems to have been an experimental phase for spammers during the fall of 2007; since then it appears to have fallen out of favour and has not been seen in significant numbers. With that in mind, it's hard these days to justify the extra resources required to perform OCR scanning as part of the spam-filtering process. Basically, if image spam is something you're still encountering, try the latest FuzzyOcr plugin, otherwise there's not much need.


Back to FAQ

Last modified 10 years ago Last modified on May 8, 2008, 2:13:36 PM