wiki:OCR

Version 2 (modified by rjl, 13 years ago) (diff)

--

Using OCR to detect image-based spam

As anti-spam tactics have improved over the years, an increasing number of spammers are turning to image-based spam to try to bypass filters. Instead of including their messages in the body of the mail, they've resorted to producing images that contain this same text instead. These HTML-based emails, then, contain a single inline image, along with some randomly-selected text passages from public-domain literature likely to be inoffensive to spam filters.

Indeed, to most spam filters these emails look like legitimate mail that happens to include an image. Bayes is all but useless against this sort of tactic, since it can only operate on the text portion of the mail (which is of course innocuous). SURBL is also powerless, since there are no URIs to find in the mail--they're displayed in the image, albeit in non-clickable form.

What's needed, then, is the ability to extract the text from these images and process it. That's a job for OCR (Optical Character Recognition), and with a little effort it's possible to add this functionality to SpamAssassin (v. 3.1.1 and later).

Netpbm

1. Install the Netpbm tools and libraries:

The first thing you'll need is a set of image manipulation tools, provided by the popular Netpbm library. If you're not downloading the full source code, you'll at least require the binaries themselves, as well as the libraries and header files. These packages might be referred to as netpbm-progs and netpbm-devel or somesuch, depending on your distribution.

Image::ExifTool

2. Install the Image::ExifTool perl module:

You'll also need the Image::ExifTool module, which you may be able to get from your favourite distribution's repository, or directly from CPAN. This will allow you to read the EXIF text encoded into certain image formats.

3. Patch the Image::ExifTool module:

Unfortunately Image::ExifTool needs a small patch to correct the way it handles GIFs with empty colour tables. This is not normally a big concern with most image-viewing software, but when a malformed image could crash your mail filter you want to make sure your image-scanning tools are robust enough to deal with such things gracefully.

Go to the directory where the Image/ExifTool? module was installed, most likely somewhere under your /usr/lib/perl5 tree, and apply the patch to the GIF.pm file.

cd /usr/lib/perl5/site-perl/5.8.6/Image/ExifTool/
patch -p3 < patch-GIF-Colortable

GOCR

4. Download the GOCR source code:

The OCR process is handled by GOCR, but while there are binary packages available for a number of distributions, you'll want the source package in this case, because there's a small patch you need to apply.

5. Patch the GOCR source code:

Some grey images have been known to trigger segmentation faults in GOCR 0.40, so a small patch has been devised to fix this vulnerability. Once again, this is not much of an issue in most normal OCR environments, since choking on an input image doesn't usually have serious consequences, but in a spam-filtering environment we need to be more graceful in how we handle such situations.

Once you've downloaded the GOCR 0.40 source code package and unpacked it, go to the src subdirectory and apply the patch to the pgm2asc.c file:

cd src
patch -p1 < patch-gocr-segfault

6. Compile and install GOCR:

From there, building GOCR is straightforward:

./configure
make
make install

7. Test your OCR setup:

To make sure that everything you've installed so far works, test it with a sample image copied from some spam you've received, preferably an image that contains some text. If it's a GIF image, for instance, run it through giftopnm:

giftopnm image001.gif > image001.pnm
giftopnm: too much input data, ignoring extra...
giftopnm: bogus character 0x00, ignoring

Don't be distressed by the informational messages you receive as a result; remember that spammers aren't always going to supply you with a standards-compliant image to work with. In fact, they often hope that a deliberately-malformed image will break your scanner, or at least register an error that leaves your scanner unsure how to classify it, hoping for the benefit of the doubt.

Now run the output through GOCR:

gocr image001.pnm

A second or two later, you should see a bunch of text as read from the image (presuming it had any to begin with, of course). There's likely to be some other garbage too, and not all of the words will be properly read--in particular, the OCR software has trouble distinguishing 'r' from 'n', and 'I' from 'l', but for the most part you should be able to recognize the words--and that's good enough for our purposes.

OCR-Text Plugin for SpamAssassin

8. Download and install the OCR plugin for SpamAssassin:

Now that you've got the underlying tools installed and working, you can download the OCR-Text Plugin for SpamAssassin. It's distributed as a patch file, but it's not really a patch. To install it, copy it to your SpamAssassin directory, wherever your local.cf file is located (e.g. /etc/mail/spamassassin) and extract it from the patch file:

cd /etc/mail/spamassassin
patch < patch-ocrtext

This should produce two files: ocrtext.pm (the plugin itself) and ocrtext.cf (its configuration file).

9. Tell SpamAssassin to load the OCR-Text plugin

Add the following lines to your v310.pre file, so that the plugin gets loaded at startup:

# OCR - performs Optical Character Recognition on spam images
#
loadplugin ocrtext /etc/mail/spamassassin/ocrtext.pm
loadplugin Mail::SpamAssassin::Timeout

Note that some binary packages of SpamAssassin don't seem to include the Timeout plugin, so if you don't have a Timeout.pm file in your SpamAssassin perl library you may need to download the full SpamAssassin source package for your version and copy the Timeout.pm file from it.

10. Edit the plugin configuration

Edit the gocr_path and pnmtools_path at the top of the ocrtext.cf file, setting these paths to the appropriate values for your installation:

## This points to your gocr binary not just the path.  Try 'which gocr'.
gocr_path       /usr/local/bin/gocr
## This is JUST the path to your pnm binarys ( i.e. pngtopnm, giftopnm, jpegtopnm )
pnmtools_path   /usr/bin

11. Test the installation

Now you can verify that you've got all the paths set properly and that you have all of the necessary pieces in place. As your amavis/maia user, run:

spamassassin -D --lint

If everything is working properly, this shouldn't produce any errors, and in particular you should see something like:

...
dbg: plugin: loading ocrtext from /etc/mail/spamassassin/ocrtext.pm
dbg: plugin: registered ocrtext=HASH(0x9d72bf8)
dbg: plugin: loading Mail::SpamAssassin::Timeout from @INC
dbg: plugin: registered Mail::SpamAssassin::Timeout=HASH(0x9466430)
...

12. Tell Maia about the new rules

If everything is working properly, you'll want to run the load-sa-rules script to make sure that Maia discovers the new rules you just added (in the ocrtext.cf file). There should be 26 of them:

SPAMPIC_FORGED_CT       Forged content-type in mime header
SUSPECT_GIF             Suspect gif image found
SUSPECT_JPG             Suspect jpeg image found
SUSPECT_PNG             Suspect png image found
SPAMPIC_UNKNOWN_GIF     Failed to read gif image header
SPAMPIC_UNKNOWN_JPG     Failed to read jpeg image header
SPAMPIC_UNKNOWN_PNG     Failed to read png image header
NONSTD_GIF              Non standard gif image header
NONSTD_JPG              Non standard jpeg image header
NONSTD_PNG              Non standard png image header
SPAMPIC_BROKEN_GIF      Contains damaged gif image
SPAMPIC_BROKEN_JPG      Contains damaged jpeg image
SPAMPIC_BROKEN_PNG      Contains damaged png image
SPAMPIC_ALPHA_1         Image contains many alphanumeric chars
SPAMPIC_ALPHA_2         Image contains many alphanumeric chars
SPAMPIC_ALPHA_3         Image contains many alphanumeric chars
SPAMPIC_MULTI_1         Contains inline pics (2)
SPAMPIC_MULTI_2         Contains inline pics (3)
SPAMPIC_MULTI_3         Contains inline pics (4)
SPAMPIC_MULTI_4         Contains inline pics (5)
SPAMPIC_MULTI_5         Contains inline pics (6)
SPAMPIC_MULTI_6         Contains inline pics (7+)
SPAMPIC_WORDS_1         Contains inline spam picture (1)
SPAMPIC_WORDS_2         Contains inline spam picture (2)
SPAMPIC_WORDS_3         Contains inline spam picture (3)
SPAMPIC_WORDS_4         Contains inline spam picture (4+)

13. Restart amavisd-maia

Now you can restart amavisd-maia and start looking for these rules in your log files, and in Maia's mail viewer, once you begin receiving mail items that contain images with text in them. The processing time on such items will be a few seconds longer than usual, but mail items without images in them won't be affected, since the OCR-Text plugin won't be called in those cases.