wiki:FuzzyOCR23

Installing FuzzyOCR 2.3b

The FuzzyOCR Plugin for SpamAssassin improves somewhat upon the standard OCR Plugin, in that it is capable of performing "fuzzy" matching of text strings. This makes it able to handle the innate inaccuracies of OCR engines, spelling mistakes, and deliberate obfuscation of words by spammers, without having to write a lot of explicit regular expression patterns to catch these variations.

This plugin does everything that the original plugin does, minus the detection of malformed JPEGs and PNGs. It does detect malformed GIFs in this version, though, and presumably it will eventually detect malformed images of other types as well.

In addition, this version offers the ability to run multiple OCR scans of the same image at different resolutions and tolerances for more thorough analysis, and a local database to cache hashes of images recognized as spam so that the resource-expensive OCR process can be avoided in the future for images that have already been seen.

The FuzzyOCR plugin is in very active development, so newer versions may also exist at http://users.own-hero.net/~decoder/fuzzyocr/. If you're feeling particularly conservative, you may wish to run the older 2.1c release.

Note for Gentoo users:

Tóth Csaba has provided an ebuild for Gentoo that covers the installation steps described in this document, including patched versions of gocr, giftext, and FuzzyOcr.pm. Download his fuzzyocr-gentoo-2.tar.bz2 package and unpack it into /usr/local, then enable the overlay if necessary in /etc/make.conf:

PORTDIR_OVERLAY="/usr/local/portage"

From there, install the mail-filter/spamassassin-fuzzyocr package. If the install fails due to a digest mismatch, this just means the FuzzyOCR plugin author has updated the 2.3b tarball without changing the version number. To correct this if it happens, do this:

cd /usr/local/portage/mail-filter/spamassassin-fuzzyocr
ebuild spamassassin-fuzzyocr-2.3b.ebuild digest

Netpbm

1. Install the Netpbm tools and libraries:

The first thing you'll need is a set of image manipulation tools, provided by the popular Netpbm library. If you're not downloading the full source code, you'll at least require the binaries themselves, as well as the libraries and header files. These packages might be referred to as netpbm-progs and netpbm-devel, libnetpbm and netpbm or somesuch, depending on your distribution.

ImageMagick

2. Install the ImageMagick suite:

The FuzzyOCR plugin uses the convert utility from the ImageMagick suite to unpack animated GIF images that spammers use to try to confuse ordinary OCR tools. Without it, only the first frame of an animated GIF would be scanned, and clever spammers simply leave the first frame blank to exploit this. Your favourite distribution most likely has a binary package available, possibly called ImageMagick or imagemagick.

Libungif

3. Install the Libungif tools and libraries

Next you'll want to install the Libungif library and its associated tools. Specifically, it's the giffix and giftext utilities you want, in order to be able to "fix" prematurely truncated GIF images, since spammers aren't known for providing well-formed images. While this library is often available in a binary package, you'll want the source package in this case, since there's a small patch to apply to it in the next step.

4. Patch the Libungif source code

In order to harden the giftext utility against a particular exploit that can cause it to crash, a small patch is required. Download the patch and apply it to the Libungif source code as follows:

cd util
patch -p0 < giftext-segfault.patch

5. Compile and install Libungif:

Once the patch has been applied, build the Libungif library and utilities:

./configure --prefix=/usr
make
make install

String::Approx

6. Install the String::Approx perl module:

The String::Approx perl module provides "fuzzy" matching for text strings, which is helpful for detecting misspelled words, words that an OCR engine misreads, and words that spammers intentionally obfuscate. For instance, '1' and 'l' look very similar to an OCR engine, so a word like "email" could be seen as "emai1" by mistake, but with fuzzy matching the two words would be seen as equivalent. You can get this module from your favourite distribution's repository, or directly from CPAN.

GOCR

7. Download the GOCR source code:

The OCR process is handled by GOCR, but while there are binary packages available for a number of distributions, you'll want the source package in this case, because there's a small patch you need to apply.

8. Patch the GOCR source code:

Some grey images have been known to trigger segmentation faults in GOCR 0.40, so a small patch has been devised to fix this vulnerability. Once again, this is not much of an issue in most normal OCR environments, since choking on an input image doesn't usually have serious consequences, but in a spam-filtering environment we need to be more graceful in how we handle such situations.

Once you've downloaded the GOCR 0.40 source code package and unpacked it, go to the src subdirectory and apply the patch to the pgm2asc.c file:

cd src
patch -p1 < patch-gocr-segfault

9. Compile and install GOCR:

From there, building GOCR is straightforward:

./configure --prefix=/usr
make
make install

10. Test your OCR setup:

To make sure that everything you've installed so far works, test it with a sample image copied from some spam you've received, preferably an image that contains some text. If it's a GIF image, for instance, run it through giftopnm:

giftopnm image001.gif > image001.pnm
giftopnm: too much input data, ignoring extra...
giftopnm: bogus character 0x00, ignoring

Don't be distressed by the informational messages you receive as a result; remember that spammers aren't always going to supply you with a standards-compliant image to work with. In fact, they often hope that a deliberately-malformed image will break your scanner, or at least register an error that leaves your scanner unsure how to classify it, hoping for the benefit of the doubt.

That's where giffix comes in. Try repairing the same image, and then run it through giftopnm again:

giffix image001.gif > image001-fixed.gif
giftopnm image001-fixed.gif > image001-fixed.pnm

This time the warning messages should be gone.

Now run the output through GOCR:

gocr image001-fixed.pnm

A second or two later, you should see a bunch of text as read from the image (presuming it had any to begin with, of course). There's likely to be some other garbage too, and not all of the words will be properly read--in particular, the OCR software has trouble distinguishing 'r' from 'n', and 'I' from 'l', but for the most part you should be able to recognize the words--and that's good enough for our purposes, especially with our "fuzzy" matching tools.

FuzzyOCR Plugin for SpamAssassin

11. Download and install the FuzzyOCR plugin for SpamAssassin:

Now that you've got the underlying tools installed and working, you can download the FuzzyOCR Plugin for SpamAssassin. To install it, unpack the tarball in a temporary subdirectory and copy FuzzyOcr.pm (the plugin itself) and FuzzyOcr.cf (its configuration file) to your SpamAssassin directory, wherever your local.cf file is located (e.g. /etc/mail/spamassassin).

Note: If there's a loadplugin line at the top of FuzzyOcr.cf, delete it; that line belongs elsewhere, as the next step explains.

12. Tell SpamAssassin to load the FuzzyOCR plugin

Add the following lines to your v310.pre file, so that the plugin gets loaded at startup:

# FuzzyOCR - performs fuzzy Optical Character Recognition on spam images
#
loadplugin FuzzyOcr /etc/mail/spamassassin/FuzzyOcr.pm
loadplugin Mail::SpamAssassin::Timeout

Note that some binary packages of SpamAssassin don't seem to include the Timeout plugin, so if you don't have a Timeout.pm file in your SpamAssassin perl library you may need to download the full SpamAssassin source package for your version and copy the Timeout.pm file from it. If you have to do so, be sure to place the Timeout.pm file in the same place as the rest of your SpamAssassin plugins are found, usually something like /usr/lib/perl5/site_perl/5.8.6/Mail/SpamAssassin.

13. Edit the word list

Copy the FuzzyOcr.words.sample file to FuzzyOcr.words in your SpamAssassin directory and edit it, adding target words to the default list (or removing some):

# Here we define the words to scan for
# Stock
alert
charts
profit
news::0.2
breaking
symbol
alert
stock
investor
international
company
money::0
million
thousand
buy
price::0.2
trade
target
banking
service
recommendation
# Pills
viagra
cialis
xanax
valium
meridia
zanaflex
levitra
medicine
legal::0.2
penis::0
medication
growth
drugs
pharmacy
prescription
# Misc
click here
software
kunde::0.2
volksbank
sparkasse

Notice that target words can optionally contain a second parameter to specify how "exact" the match must be. By default, the plugin uses a threshold of focr_threshold (default: 0.3), specified in the FuzzyOcr.cf file to determine how loosely it should try to match words, but for some words this can be too loose, resulting in false positives. You can override this threshold for specific words in the word list by specifying a threshold value after the word itself, separated by ::. For example:

alpha
beta::0
gamma::0.2

In this example, the word alpha is matched with the usual focr_threshold value. The word beta is matched using a threshold of 0, which is essentially an "exact" match, while gamma is matched with a threshold of 0.2.

As a rule of thumb, if you start to see false positives with a particular word, reduce its threshold by a small amount--say in increments of 0.1--until the false positives stop occurring.

14. Edit the FuzzyOcr.cf file

Logging Options
# Verbosity level (see manual) Attention: Don't set to 0, but to 0.0 for quiet operation. (Default value: 1)
#focr_verbose 1
#
# Logfile (make sure it is writable by the plugin) (Default value: /etc/mail/spamassassin/FuzzyOcr.log)
focr_logfile /etc/mail/spamassassin/FuzzyOcr.log

The plugin logs its activities to focr_logfile, which by default is /etc/mail/spamassassin/FuzzyOcr.log. This file must be writable by your amavis/maia user. Three verbosity levels are supported (via focr_verbose):

  • 0.0 : Quiet mode.
  • 1 : All words and their corresponding measured distance ("fuzz"), e.g.
6.0 FUZZY_OCR    BODY: Mail contains an image with common spam text inside
                        Words found:
                        "viagra" with fuzz of 0.2
                        "cialis" with fuzz of 0
                        "viagra" with fuzz of 0.2
                        "levitra" with fuzz of 0
                        (4 word occurrences found)
  • 2 : Additional debugging information.
Word Lists
# Here we define the words to scan for (Default value: /etc/mail/spamassassin/FuzzyOcr.words)
focr_global_wordlist /etc/mail/spamassassin/FuzzyOcr.words
#
# This is the path RELATIVE to the respective home directory for the personalized list
# This list is merged with the global word list on execution (Default value: .spamassassin/fuzzyocr.words)
#focr_personal_wordlist .spamassassin/fuzzyocr.words

You can specify a global list of target words in a text file with one word per line (as explained in Step 13, above). The default for this file is /etc/mail/spamassassin/FuzzyOcr.words, but you can change this to point to any file you like by editing focr_global_wordlist.

While this version of the plugin also offers the ability to add per-user word lists (with the focr_personal_wordlist setting), this has no usefulness in a Maia Mailguard context, where there's only one SpamAssassin user (i.e. your amavis/maia user).

SpamAssassin Version
# Set this to 1 if you are running a version < 3.1.4.
# This will disable a function used in conjunction with animated gifs that isn't available in earlier versions (Default value: 0.0)
#focr_pre314 0.0

The plugin will work with SpamAssassin versions 3.1 and later, but for best performance you should be using version 3.1.4 or later, which includes better support for dealing with animated GIFs. If you're using an earlier version of SpamAssassin, set focr_pre314 to 1 to use a less-efficient (but more compatible) alternative.

Path Settings
#focr_bin_giffix /usr/bin/giffix
#focr_bin_giftext /usr/bin/giftext
#focr_bin_gifasm /usr/bin/gifasm
#focr_bin_gifinter /usr/bin/gifinter
#focr_bin_giftopnm /usr/bin/giftopnm
#focr_bin_jpegtopnm /usr/bin/jpegtopnm
#focr_bin_pngtopnm /usr/bin/pngtopnm
#focr_bin_ppmhist /usr/bin/ppmhist
#focr_bin_convert /usr/bin/convert
#focr_bin_identify /usr/bin/identify
#focr_bin_gocr /usr/bin/gocr

By default, the plugin expects to find all of the utilities for Libungif (giffix, giftext, gifasm, gifinter), Netpbm (giftopnm, jpegtopnm, pngtopnm, ppmhist), ImageMagick (convert, identify), and GOCR (gocr) in /usr/bin. If you've installed these files elsewhere, you'll want to override the path settings for them here so the plugin can find them.

Scan Sets
##### Scansets, comma separated (Default value: $gocr -i -, $gocr -l 180 -d 2 -i -) #####
# Each scanset consists of one or more commands which make text out of pnm input.
# Each scanset is run separately on the PNM data, results are combined in scoring.
#focr_scansets $gocr -i -, $gocr -l 180 -d 2 -i -
#
# To use only one scan with default values, uncomment the next line instead
#focr_scansets $gocr -i -
#
# Some examples for more advanced sets
# This one first uses the standard scan, then a scanset which first reduces the image to 3 colors and then scans it with custom settings
# and then it scans again only with these custom settings
# NOTE: This is for advanced users only, if you have questions how to use this, ask on the ML or on IRC
#focr_scansets $gocr -i -, pnmnorm 2>$errfile | pnmquant 3 2>>$errfile | pnmnorm 2>>$errfile | $gocr -l 180 -d 2 -i -, $gocr -l 180 -d 2 -i -

This version of the plugin lets you perform multiple OCR scans on the same image, using different scan resolutions and tolerances. This more thorough approach results in better word-matches, at the cost of more processing time. This is particularly useful for handling images that use odd combinations of foreground and background colours (e.g. white text on coloured backgrounds), lines, dots and other "noise" patterns intended to throw off OCR engines. Scanning images with just one resolution and tolerance setting is necessarily a compromise--you end up choosing settings that work well for most images, but a lot of the edge cases slip through. By running the OCR routines two or three times with different scanning parameters, the plugin can catch more of those edge cases.

The focr_scansets setting lets you specify the command-line options to GOCR and other utilities for one or more scan sets, separated by commas.

If you're just interested in doing a single scan at the default resolution and tolerances, you can still do so by specifying just one scan set:

focr_scansets $gocr -i -

The default, however, is to run two scan sets--one at the default resolution and tolerances, the second at a grey level of 180 and dust size of 2 pixels:

focr_scansets $gocr -i -, $gocr -l 180 -d 2 -i -

As the commented "advanced" example illustrates, you can specify any command-line you like as a scanset, not just GOCR commands. You can chain image-manipulation tools together as desired, as long as the chain begins with the image in PNM format and ends with a call to GOCR.

If you decide to experiment with command-line options and tool-chains, the plugin's author offers the following advice:

  • pnmnorm, pnminvert and pnmquant are useful with white text or text with many colors.
  • If you get images which are littered with small dots/lines, try -d 2 as an argument to gocr.
  • The -l setting often helps, try values like 180, 140, or 100.
Miscellaneous Settings
# Timeout for the plugin, in seconds. (Maximum runtime of the plugin) (Default value: 10)
#focr_timeout 10
#
# Default detection treshold (see manual) (Default value: 0.3) (Can be changed on a per word basis in the wordlist).
#focr_threshold 0.3
#
# This is the score for a hit after focr_counts_required matches
#focr_base_score 4
#
# This is the additional score for every additional match after focr_counts_required matches (Default value: 1)
#focr_add_score 1
#
# This is the score to give for a wrong content-type (e.g. JPEG image but content type says GIF) (Default value: 1.5)
#focr_wrongctype_score 1.5
#
# This is the score to give for a corrupted image (This currently affects only GIF images) (Default value: 2.5)
#focr_corrupt_score 2.5
#
# This is the score to give for a corrupted unfixable image (This currently affects only GIF images) (Default value: 5)
#focr_corrupt_unfixable_score 5
#
# This is used to disable the OCR engine if the message has already more points than this value (Default value: 10)
#focr_autodisable_score 10
#
# Number of minimum matches before the rule scores (Default value: 2)
#focr_counts_required 2
#
# Specifies, how many frames an animated gif must contain, so the second (less resource consuming) animated gif test is used. (Default value: 5)
#focr_gif_max_frames 5

The OCR process can take time, especially if you're running multiple scan sets. With the focr_timeout setting however, you can set an upper bound (in seconds) on how much time the plugin will spend before returning its results, if any (default: 10).

The focr_threshold setting determines how "fuzzy" a word match is allowed to be. Higher settings result in more matches, but also more false positives; lower settings result in fewer matches, and more false negatives. Finding the right tolerance is the key, and it may vary from word to word, depending on how good the OCR engine is. As explained above in Step 13, you can specify this tolerance on a per-word basis in the word list as well.

It's useful to understand how the plugin assigns its score value to the FUZZY_OCR rule. The rule is only triggered if there are at least focr_counts_required word matches (default: 2) in the image. At that point, the rule's score becomes focr_base_score + focr_add_score for every additional word match (default: 4 + 1/word after the second match). At default values, then, two matching words would score a total of 4 points; three matching words would score 5 points; four would score 6 points, etc. Feel free to adjust these values to your tastes. Don't forget to uncomment these values if you change them!

The focr_wrongctype_score setting lets you penalize mail that contains images that claim to be one type but are actually another, such as a GIF that's advertised as a JPEG in the MIME content-type header. focr_corrupt_score similarly penalizes malformed GIF images, and focr_corrupt_unfixable_score penalizes GIF images so badly malformed that they can't be repaired. Eventually perhaps these will penalize malformed images of other types.

The focr_autodisable_score setting is more controversial. In principle it's a way to save some processing cycles by avoiding an OCR scan if there are already enough other rules triggering on the mail to achieve this minimum score (default: 10). The downside is that this mucks with efforts to statistically measure the performance of the OCR-based rules, since there's no longer any guarantee that these rules will be called every time they should be. Upcoming Maia features such as Dynamic Score Balancing will not work properly if this setting is used, so unless you're truly strapped for processor cycles it's advisable to set this value to an unrealistically high value (e.g. 999) to effectively disable it.

When it comes to handling animated GIFs, the plugin can use one of two tools to unpack the frames--ImageMagick?'s convert or Netpbm's gifasm. The convert utility is fast, but for images with a lot of frames the gifasm tool is more efficient. The focr_gif_max_frames setting lets you determine the frame-count at which the gifasm tool should be used instead of convert (default: 5). If you want to use gifasm all the time, of course, just set this to 1.

Image Hash Database
##### Image Hash Database settings (Experimental, disabled by default) #####
#
# Set this to 1 to enable the Image Hash database feature (Default value: 0.0)
#focr_enable_image_hashing 0.0
#
# The score is saved with the hash in the database, so no extra scoring for a db hit is required.
#
# If the image hash database feature is enabled, specify the file here to use as database (Default value: /etc/mail/spamassassin/FuzzyOcr.hashdb)
#focr_digest_db /etc/mail/spamassassin/FuzzyOcr.hashdb
#
# Automatically add hashes of spam images recognized by OCR to the Image Hash database, to disable, set to 0.0 (Default value: 1)
#focr_hashing_learn_scanned 1

This version of the plugin includes a custom hash database that serves as a local cache for previously recognized images, so that the OCR engine won't need to be called if the image has been received before. The default location of this database is /etc/mail/spamassassin/FuzzyOcr.hashdb, but you can change this by setting an explicit path for focr_digest_db. The feature is still considered "experimental" and is disabled by default, so if you wish to use it, you need to enable it by setting focr_enable_image_hashing to 1. If you do enable it, you'll want to set focr_hashing_learn_scanned to 1 as well, to ensure that the plugin not only reads the database but writes to it as well. Almost needless to say, your amavis/maia user needs to be able to write to this file in that case.

15. Patch the FuzzyOcr.pm file

There are a couple of small fixes to be made to the FuzzyOcr.pm file to make the hashing database work properly. Until these are eventually corrected by the plugin's author, the fix is to apply the following patches:

*** FuzzyOcr.pm-orig  2006-08-27 04:35:12.000000000 -0700
--- FuzzyOcr.pm       2006-08-30 15:10:17.934275225 -0700
***************
*** 490,494 ****
      flock( DB, LOCK_EX );
      seek( DB, 0, 2 );
!     print DB "$score::$digest\n";
      flock( DB, LOCK_UN );
      close(DB);
--- 490,494 ----
      flock( DB, LOCK_EX );
      seek( DB, 0, 2 );
!     print DB "${score}::${digest}\n";
      flock( DB, LOCK_UN );
      close(DB);

Copy that block of text to a file called hashdb.patch or somesuch, and apply it with:

patch -p0 < hashdb.patch

Note: If the patching process fails, don't worry--that most likely just means the plugin author has fixed the problem and updated the tarball, so the version you've downloaded already contains the fix.

The second patch is a safeguard against the "poisoning" of your hash database. Without this patch, spammers could include innocuous images (e.g. logos from businesses like eBay, Amazon, PayPal, etc.) alongside their spam images, and the FuzzyOCR plugin would add those to its hash database as well. The patch ensures that only images that contain at least one matching word from the word-list get added to the hash database. Apply the patch as usual:

patch -p0 < fuzzyocr-23b-hashdb-poison.patch

16. Test the installation

Now you can verify that you've got all the paths set properly and that you have all of the necessary pieces in place. As your amavis/maia user, run:

spamassassin -D --lint

If everything is working properly, this shouldn't produce any errors, and in particular you should see something like:

...
plugin: loading FuzzyOcr from /etc/mail/spamassassin/FuzzyOcr.pm
plugin: registered FuzzyOcr=HASH(0xb9fde84)
plugin: loading Mail::SpamAssassin::Timeout from @INC
plugin: registered Mail::SpamAssassin::Timeout=HASH(0xb18501c)
...

If for some reason you don't see the FuzzyOCR module being loaded, it may be because of some security-related settings in your operating system that may require Perl modules to have their execute bits set. Usually this is unnecessary (and inadvisable), but one Maia user has reported that this was necessary to get the plugin to load properly:

chmod 744 FuzzyOcr.pm

The plugin also comes with a number of test emails in the samples subdirectory. As your amavis/maia user, you can test each of these to make sure the plugin detects them properly.

spamassassin -t < animated-gif.eml
...
  21 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "alert" in 4 lines
                            "charts" in 1 lines
                            "symbol" in 1 lines
                            "alert" in 4 lines
                            "stock" in 2 lines
                            "company" in 3 lines
                            "trade" in 1 lines
                            "xanax" in 1 lines
                            "meridia" in 1 lines
                            "growth" in 1 lines
                            (19 word occurrences found)

spamassassin -t < corrupted-gif.eml
...
 1.5 FUZZY_OCR_WRONG_CTYPE  BODY: Mail contains an image with wrong
                            content-type set
                            Image has format "GIF" but content-type is
                            "image/jpeg"
 2.5 FUZZY_OCR_CORRUPT_IMG  BODY: Mail contains a corrupted image
                            Corrupt image: GIF-LIB error: Image is
                            defective, decoding aborted.
  12 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "alert" in 1 lines
                            "alert" in 1 lines
                            "stock" in 2 lines
                            "investor" in 1 lines
                            "company" in 1 lines
                            "trade" in 1 lines
                            "target" in 1 lines
                            "service" in 1 lines
                            "recommendation" in 1 lines
                            (10 word occurrences found)

spamassassin -t < jpeg.eml
...
 4.0 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "viagra" in 2 lines
                            "cialis" in 1 lines
                            "levitra" in 1 lines
                            (4 word occurrences found)

spamassassin -t < png.eml
...
  28 FUZZY_OCR              BODY: Mail contains an image with common spam text inside
                            Words found:
                            "alert" in 2 lines
                            "news" in 2 lines
                            "symbol" in 1 lines
                            "alert" in 2 lines
                            "stock" in 1 lines
                            "investor" in 3 lines
                            "company" in 2 lines
                            "buy" in 1 lines
                            "price" in 2 lines
                            "trade" in 2 lines
                            "target" in 2 lines
                            "service" in 2 lines
                            "recommendation" in 1 lines
                            "levitra" in 1 lines
                            "software" in 2 lines
                            (26 word occurrences found)

Next you can verify that the hashing database is working properly (if you've enabled it, that is), by trying to test one of those emails a second time:

spamassassin -t < animated-gif.eml
...
  21 FUZZY_OCR_KNOWN_HASH   BODY: Mail contains an image with known hash
                            Hash
                            "1009752:453:743:28::255:255:255:255:308010::0
                            :0:0:0:17164::0:64:128:52:3217::128:0:0:38:303
                            1::128:128:0:113:1452" is in the database.

Note: If the animated-gif.eml test fails, try setting focr_gif_max_frames to 1 and try again. This will use an alternate method for unpacking the image frames that may work better for you than the default.

Note: If the png.eml test fails, you're probably using a more alpha version of this plugin that has a small typo. Download the latest version of the plugin, as the tarball has been updated with the fix.

17. Tell Maia about the new rules

If everything is working properly, you'll want to run the load-sa-rules script to make sure that Maia discovers the new rules you just added (in the FuzzyOcr.cf file). There should be a handful of new rules:

[load-sa-rules] Adding new rule: FUZZY_OCR (Mail contains an image with common spam text inside)
[load-sa-rules] Adding new rule: FUZZY_OCR_WRONG_CTYPE (Mail contains an image with wrong content-type set)
[load-sa-rules] Adding new rule: FUZZY_OCR_CORRUPT_IMG (Mail contains a corrupted image)
[load-sa-rules] Adding new rule: FUZZY_OCR_KNOWN_HASH (Mail contains an image with known hash)
[load-sa-rules] 4 new rules added (3214 rules total), all scores updated.

18. Restart amavisd-maia

Now you can restart amavisd-maia and start looking for these rules in your log files, and in Maia's mail viewer, once you begin receiving mail items that contain images with text in them. The processing time on such items will be a few seconds longer than usual, but mail items without images in them won't be affected, since the FuzzyOCR plugin won't be called in those cases.

As a side-note, you may notice some unusual warnings when you run the process-quarantine script, such as:

Subroutine new redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 116.
Subroutine parse_config redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 126.
Subroutine dummy_check redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 223.
Subroutine fuzzyocr_check redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 227.
Subroutine load_global_words redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 237.
Subroutine load_personal_words redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 255.
Subroutine parse_scansets redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 278.
Subroutine max redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 285.
Subroutine reorder redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 293.
Subroutine pipe_io redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 298.
Subroutine handle_error redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 410.
Subroutine logfile redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 416.
Subroutine check_image_hash_db redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 435.
Subroutine add_image_hash_db redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 475.
Subroutine calc_image_hash redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 497.
Subroutine debuglog redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 537.
Subroutine wrong_ctype redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 543.
Subroutine corrupt_img redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 562.
Subroutine known_img_hash redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 587.
Subroutine check_fuzzy_ocr redefined at /etc/mail/spamassassin/FuzzyOcr.pm line 602.

These warnings are annoying but harmless noise that results from what seems to be an oversight on the part of the plugin author. We can hope that he will eventually modify his plugin to make it behave properly when more than one SpamAssassin object is loaded into memory at the same time, but until then, you can safely ignore these warnings. A workaround in the process-quarantine script that ships with Maia 1.0.2 should also take care of this, if the plugin author hasn't fixed it by then.


Back to FAQ

Last modified 16 years ago Last modified on Sep 6, 2006, 11:24:49 PM