Opened 15 years ago

Closed 13 years ago

#358 closed enhancement (invalid)

Add OCR support

Reported by: rjl Owned by: rjl
Priority: normal Milestone: 1.1.0
Component: amavisd-maia Version: 1.0.1
Severity: normal Keywords: OCR image


As we face an increasing tide of image-spam, OCR is becoming a more useful tool in our arsenal. A plugin for SpamAssassin makes use of OCR tools to score images that contain text, but in order to properly subject the OCR-extracted text to the full battery of SpamAssassin tests, the text needs to be extracted before it gets handed to SpamAssassin in the first place. Amavisd-maia should be doing this, as part of its unpacking and decoding duty.

When the text is extracted, it should be appended to the body of the mail for the purpose of submitting the whole thing to SpamAssassin. That way SpamAssassin doesn't need to know anything about OCR, and no plugins are needed. The OCR-extracted text is then treated like any other text in the body of the mail, so its URLs can be tested against SURBL, and the words and other tokens can be tested against the regular expression rules and the Bayes database.

Note that the OCR-extracted text should only be appended to the body for the purpose of the SpamAssassin scan--it should not become part of the actual mail contents stored in the database or reported to hashing systems. The pristine original should be used for these purposes.


Change History (2)

comment:1 Changed 15 years ago by rjl

  • Status changed from new to assigned

One potential complication: this risks breaking the hashing systems (e.g. Razor, Pyzor, DCC, SpamCop?), since these hash tests are performed by SpamAssassin based on the mail that gets submitted to it for scanning purposes. By modifying that mail before we submit it to SpamAssassin, we effectively change the hash that SpamAssassin will try to look up against these databases, so the lookups will almost certainly fail. Even if we're careful to only submit the pristine original when we do spam reporting, we still have to solve the problem of hash lookups at scan-time.

One possible solution is of course to move the hashing tests out of SpamAssassin and into amavisd-maia, where we have more control over what gets submitted to them. That's messy, but the alternative would appear to be patching SpamAssassin itself, which may be even messier.

comment:2 Changed 13 years ago by rjl

  • Resolution set to invalid
  • Status changed from assigned to closed

This is being handled jointly by the FuzzyOcr? developers and the SpamAssassin developers, such that Maia should not need to do anything special.

Note: See TracTickets for help on using tickets.