E-mail management and junk/duplicate messages

I get a lot of duplicate, junk e-mail. I'm not referring here to unsolicited commercial e-mail (spam), but to duplicate messages from people at Purdue or people I know. Most are seminar announcements; I regularly receive five announcements for the same talk. Some cases are more innocent than others. Sometimes a recipient is accidentally on a distribution list twice. Sometimes a seminar announcement is sent by multiple sources. And sometimes, it is school-policy to flood us with repeat announcements. (In Fall 2003, it was official policy from Professor Krause to send ECE694 announcements "starting the Friday before the seminar and every other day after that with a reminder the final day.")

I have no desire to receive the same message more than once. However, offending senders occasionally send important messages, and I would like to receive the initial copy. Therefore, coarse solutions, such as blocking all mail from certain senders, are inadequate for me. To manage the problem, I use the following tools with the procmail mail-processing tool ("man procmail" for details). Each moves the duplicate messages to a mail folder called "dupes."

(1) Eliminate messages with duplicate Message-ID: fields:
This one helps if your e-mail address is on a distribution list more than once or if someone uses "bounce" forwarding to send you a duplicate message. Add the below to .procmailrc

:0 Whc: msgid.lock
| formail -D 8192 msgid.cache
:0 a:
mail/dupes

(2) Eliminate duplicates based on MD5 checksum of message body:
This is a fairly fancy script (which I did not write but did modify a bit) that will keep a list of MD5 checksums for all messages received and identify duplicates. You may want to run a cron job to keep the log of checksums from growing wildly.

The script is here and can be placed in the .procmail directory in your home directory. Then add the following to your .procmailrc:

INCLUDERC=/{path-to-your-homedir}/.procmail/dupcheck.rc

(3) Target duplicate seminar announcements:
Many repeated seminar announcements will not have exactly the same text, thereby sneaking past a checksum-based filter. To counter this problem, I wrote a script to parse e-mails and try to find a date, time, venue, and "seminar-announcement-like" words (e.g., abstract, speaker, seminar, faculty candidate). If two such e-mails have the same date, time, and venue and both have a prespecified minimum number of seminar-announcement-like words, the latter is assumed to be a duplicate and the header "X-Dupe-Seminar: Yes" is added for procmail to catch.

This script is imperfect and faces several challenges. First, it recognizes only known venues. New ones are easily added, but it is not automatic. Venue descriptions do not have enough regularity or similarity for automatic detection, and I lack the patience to enumerate every possible expression of every campus venue (including common misspellings). Second, not all seminar announcements contain words like abstract, speaker, or even the word seminar itself. However, eliminating the keyword requirement would cause many false positives. Third, Purdue staff have taken to using tactics apparently designed to obscure seminar announcements very similar to tactics used by commercial spammers to foil filters (e.g., gappy text such as "s e m i n a r" instead of "seminar"); defeating these tactics requires updating the keywords in the script. Fourth, there are many ways to express the date and time of a seminar, and some them of are quite similar to the email "Date:" header often included in the body of forwarded seminar announcements; I try to address this issue by looking for many different date/time formats but ignoring any date/time that is given to the nearest second or includes a GMT offset (e.g., "Tue, 22 Aug 2000 11:22:55 -0500" is ignored). The script needs to be extended to detect more dates without years.

However, I find the script quite effective.

The duplicate semianr script is linked here.

Example usage in a .procmailrc file:

# duplicate seminar check

# check body only for seminar details. headers would confuse it
:0bw: seminar.lock
* < 512000
HEADER=|.procmail/seminar.pl

# added size check 9/5/2005
:0hfw: seminar.lock
* < 512000
|formail -A "$HEADER"

:0:
* ^X-Dupe-Seminar: Yes
mail/dupes


Copyright (C) 1999-2005 by Michael D. Powell.  Last Updated December 8, 2005.