Changes between Version 3 and Version 4 of OCRmain


Ignore:
Timestamp:
May 8, 2008, 2:13:36 PM (15 years ago)
Author:
rjl
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • OCRmain

    v3 v4  
    11= Using OCR to detect image-based spam =
    22
    3 As anti-spam tactics have improved over the years, an increasing number of spammers are turning to image-based spam to try to bypass filters.  Instead of including their messages in the body of the mail, they've resorted to producing images that contain this same text instead.  These HTML-based emails, then, contain a single inline image, along with some randomly-selected text passages from public-domain literature likely to be inoffensive to spam filters.
     3As anti-spam tactics have improved over the years, some spammers are turning to image-based spam to try to bypass filters.  Instead of including their messages in the body of the mail, they've resorted to producing images that contain this same text instead.  These HTML-based emails, then, contain a single inline image, along with some randomly-selected text passages from public-domain literature likely to be inoffensive to spam filters.
    44
    55Indeed, to most spam filters these emails look like legitimate mail that happens to include an image.  Bayes is all but useless against this sort of tactic, since it can only operate on the text portion of the mail (which is of course innocuous).  SURBL is also powerless, since there are no URIs to find in the mail--they're displayed in the image, albeit in non-clickable form.
     
    77What's needed, then, is the ability to extract the text from these images and process it.  That's a job for OCR (Optical Character Recognition), and with a little effort it's possible to add this functionality to SpamAssassin (v. 3.1.1 and later).
    88
    9 There are currently two OCR plugins available for SpamAssassin:
     9The most effective and best-supported OCR plugin for SpamAssassin is [http://fuzzyocr.own-hero.net/ FuzzyOcr],
     10though you'll likely need to use the latest SVN version if you're using SpamAssassin 3.2.x or newer.
    1011
    11 The [wiki:OCR original OCR plugin] was the first OCR plugin to be developed and works quite well, but it relies on a set of regular expression patterns to match image-based text, and these patterns are not particularly easy to add to.  On the other hand, it has very good detection for malformed images, and penalizes them accordingly, which is quite useful.
    12 
    13 The [wiki:FuzzyOCR23 "fuzzy" OCR plugin] is based on the original plugin, but with the improvement of adding "fuzzy" text matching to make it easier to add new target words without having to define regular expression patterns.  Unfortunately it lacks the original plugin's malformed JPEG and PNG image detection at the moment (it does detect malformed GIFs in the current version), though this is likely to change in the future.
    14 
     12'''NOTE:''' Image spam seems to have been an experimental phase for spammers during the fall of 2007; since
     13then it appears to have fallen out of favour and has not been seen in significant numbers.  With that in
     14mind, it's hard these days to justify the extra resources required to perform OCR scanning as part of the
     15spam-filtering process.  Basically, if image spam is something you're still encountering, try the latest
     16[http://fuzzyocr.own-hero.net/ FuzzyOcr] plugin, otherwise there's not much need.
    1517
    1618----