Installing the OCR-Text plugin
The original OCR plugin for SpamAssassin (called "OCR-Text") is a powerful and useful addition based on the use of regular expression patterns to detect spammy text within the contents of inline images. A slightly more advanced version--the FuzzyOCR plugin--improves upon this one with "fuzzy" matching, but lacks the ability to detect malformed JPEG and PNG images and penalize them accordingly (though it does detect malformed GIF images). As such, this original version remains marginally useful.
Netpbm
1. Install the Netpbm tools and libraries:
The first thing you'll need is a set of image manipulation tools, provided by the popular Netpbm library. If you're not downloading the full source code, you'll at least require the binaries themselves, as well as the libraries and header files. These packages might be referred to as netpbm-progs and netpbm-devel or somesuch, depending on your distribution.
Image::ExifTool
2. Install the Image::ExifTool perl module:
You'll also need the Image::ExifTool module, which you may be able to get from your favourite distribution's repository, or directly from CPAN. This will allow you to read the EXIF text encoded into certain image formats.
3. Patch the Image::ExifTool module:
Unfortunately Image::ExifTool needs a small patch to correct the way it handles GIFs with empty colour tables. This is not normally a big concern with most image-viewing software, but when a malformed image could crash your mail filter you want to make sure your image-scanning tools are robust enough to deal with such things gracefully.
Go to the directory where the Image/ExifTool module was installed, most likely somewhere under your /usr/lib/perl5 tree, and apply the patch to the GIF.pm file.
cd /usr/lib/perl5/site-perl/5.8.6/Image/ExifTool/ patch -p3 < patch-GIF-Colortable
GOCR
4. Download the GOCR source code:
The OCR process is handled by GOCR, but while there are binary packages available for a number of distributions, you'll want the source package in this case, because there's a small patch you need to apply.
5. Patch the GOCR source code:
Some grey images have been known to trigger segmentation faults in GOCR 0.40, so a small patch has been devised to fix this vulnerability. Once again, this is not much of an issue in most normal OCR environments, since choking on an input image doesn't usually have serious consequences, but in a spam-filtering environment we need to be more graceful in how we handle such situations.
Once you've downloaded the GOCR 0.40 source code package and unpacked it, go to the src subdirectory and apply the patch to the pgm2asc.c file:
cd src patch -p1 < patch-gocr-segfault
6. Compile and install GOCR:
From there, building GOCR is straightforward:
./configure make make install
7. Test your OCR setup:
To make sure that everything you've installed so far works, test it with a sample image copied from some spam you've received, preferably an image that contains some text. If it's a GIF image, for instance, run it through giftopnm:
giftopnm image001.gif > image001.pnm giftopnm: too much input data, ignoring extra... giftopnm: bogus character 0x00, ignoring
Don't be distressed by the informational messages you receive as a result; remember that spammers aren't always going to supply you with a standards-compliant image to work with. In fact, they often hope that a deliberately-malformed image will break your scanner, or at least register an error that leaves your scanner unsure how to classify it, hoping for the benefit of the doubt.
Now run the output through GOCR:
gocr image001.pnm
A second or two later, you should see a bunch of text as read from the image (presuming it had any to begin with, of course). There's likely to be some other garbage too, and not all of the words will be properly read--in particular, the OCR software has trouble distinguishing 'r' from 'n', and 'I' from 'l', but for the most part you should be able to recognize the words--and that's good enough for our purposes.
OCR-Text Plugin for SpamAssassin
8. Download and install the OCR plugin for SpamAssassin:
Now that you've got the underlying tools installed and working, you can download the OCR-Text Plugin for SpamAssassin. It's distributed as a patch file, but it's not really a patch. To install it, copy it to your SpamAssassin directory, wherever your local.cf file is located (e.g. /etc/mail/spamassassin) and extract it from the patch file:
cd /etc/mail/spamassassin patch < patch-ocrtext
This should produce two files: ocrtext.pm (the plugin itself) and ocrtext.cf (its configuration file).
9. Tell SpamAssassin to load the OCR-Text plugin
Add the following lines to your v310.pre file, so that the plugin gets loaded at startup:
# OCR - performs Optical Character Recognition on spam images # loadplugin ocrtext /etc/mail/spamassassin/ocrtext.pm loadplugin Mail::SpamAssassin::Timeout
Note that some binary packages of SpamAssassin don't seem to include the Timeout plugin, so if you don't have a Timeout.pm file in your SpamAssassin perl library you may need to download the full SpamAssassin source package for your version and copy the Timeout.pm file from it.
10. Edit the plugin configuration
Edit the gocr_path and pnmtools_path at the top of the ocrtext.cf file, setting these paths to the appropriate values for your installation:
## This points to your gocr binary not just the path. Try 'which gocr'. gocr_path /usr/local/bin/gocr ## This is JUST the path to your pnm binarys ( i.e. pngtopnm, giftopnm, jpegtopnm ) pnmtools_path /usr/bin
11. Test the installation
Now you can verify that you've got all the paths set properly and that you have all of the necessary pieces in place. As your amavis/maia user, run:
spamassassin -D --lint
If everything is working properly, this shouldn't produce any errors, and in particular you should see something like:
... dbg: plugin: loading ocrtext from /etc/mail/spamassassin/ocrtext.pm dbg: plugin: registered ocrtext=HASH(0x9d72bf8) dbg: plugin: loading Mail::SpamAssassin::Timeout from @INC dbg: plugin: registered Mail::SpamAssassin::Timeout=HASH(0x9466430) ...
12. Tell Maia about the new rules
If everything is working properly, you'll want to run the load-sa-rules script to make sure that Maia discovers the new rules you just added (in the ocrtext.cf file). There should be 26 of them:
SPAMPIC_FORGED_CT Forged content-type in mime header SUSPECT_GIF Suspect gif image found SUSPECT_JPG Suspect jpeg image found SUSPECT_PNG Suspect png image found SPAMPIC_UNKNOWN_GIF Failed to read gif image header SPAMPIC_UNKNOWN_JPG Failed to read jpeg image header SPAMPIC_UNKNOWN_PNG Failed to read png image header NONSTD_GIF Non standard gif image header NONSTD_JPG Non standard jpeg image header NONSTD_PNG Non standard png image header SPAMPIC_BROKEN_GIF Contains damaged gif image SPAMPIC_BROKEN_JPG Contains damaged jpeg image SPAMPIC_BROKEN_PNG Contains damaged png image SPAMPIC_ALPHA_1 Image contains many alphanumeric chars SPAMPIC_ALPHA_2 Image contains many alphanumeric chars SPAMPIC_ALPHA_3 Image contains many alphanumeric chars SPAMPIC_MULTI_1 Contains inline pics (2) SPAMPIC_MULTI_2 Contains inline pics (3) SPAMPIC_MULTI_3 Contains inline pics (4) SPAMPIC_MULTI_4 Contains inline pics (5) SPAMPIC_MULTI_5 Contains inline pics (6) SPAMPIC_MULTI_6 Contains inline pics (7+) SPAMPIC_WORDS_1 Contains inline spam picture (1) SPAMPIC_WORDS_2 Contains inline spam picture (2) SPAMPIC_WORDS_3 Contains inline spam picture (3) SPAMPIC_WORDS_4 Contains inline spam picture (4+)
13. Restart amavisd-maia
Now you can restart amavisd-maia and start looking for these rules in your log files, and in Maia's mail viewer, once you begin receiving mail items that contain images with text in them. The processing time on such items will be a few seconds longer than usual, but mail items without images in them won't be affected, since the OCR-Text plugin won't be called in those cases.

