I have no desire to receive the same message more than once. However, offending senders occasionally send important messages, and I would like to receive the initial copy. Therefore, coarse solutions, such as blocking all mail from certain senders, are inadequate for me. To manage the problem, I use the following tools with the procmail mail-processing tool ("man procmail" for details). Each moves the duplicate messages to a mail folder called "dupes."
(1) Eliminate messages with duplicate Message-ID: fields:
This one helps if your e-mail address is on a distribution list more than
once or if someone uses "bounce" forwarding to send you a duplicate
message. Add the below to .procmailrc
:0 Whc: msgid.lock
| formail -D 8192 msgid.cache
:0 a:
mail/dupes
(2) Eliminate duplicates based on MD5 checksum of message body:
This is a fairly fancy script (which I did not write but did modify a bit)
that will keep a list of MD5 checksums for all messages received and
identify duplicates. You may want to run a cron job to keep the log of
checksums from growing wildly.
The script is here and can be placed
in the .procmail directory in your home directory. Then add the following
to your .procmailrc:
INCLUDERC=/{path-to-your-homedir}/.procmail/dupcheck.rc
(3) Target duplicate seminar announcements:
Many repeated seminar announcements will not have exactly the same text,
thereby sneaking past a checksum-based filter. To counter this problem, I
wrote a script to parse e-mails and try to find a date, time, venue, and
"seminar-announcement-like" words (e.g., abstract, speaker, seminar,
faculty candidate). If two such e-mails have the same date, time, and
venue and both have a prespecified minimum number of
seminar-announcement-like words, the latter is assumed to be a duplicate
and the header "X-Dupe-Seminar: Yes" is added for procmail to catch.
This script is imperfect and faces several challenges. First, it
recognizes only known venues. New ones are easily added, but it is not
automatic. Venue descriptions do not have enough regularity or similarity
for automatic detection, and I lack the patience to enumerate every
possible expression of every campus venue (including common misspellings).
Second, not all seminar announcements contain words like abstract,
speaker, or even the word seminar itself. However, eliminating the
keyword requirement would cause many false positives. Third, Purdue staff
have taken to using tactics apparently designed to obscure seminar
announcements very similar to tactics used by commercial spammers to foil
filters (e.g., gappy text such as "s e m i n a r" instead of "seminar");
defeating these tactics requires updating the keywords in the script.
Fourth, there are many ways to express the date and time of a seminar, and
some them of are quite similar to the email "Date:" header often included
in the body of forwarded seminar announcements; I try to address this
issue by looking for many different date/time formats but ignoring any
date/time that is given to the nearest second or includes a GMT offset
(e.g., "Tue, 22 Aug 2000 11:22:55 -0500" is ignored). The script needs to
be extended to detect more dates without years.
However, I find the script quite effective.
The duplicate semianr script is linked here.
Example usage in a .procmailrc file:
# duplicate seminar check
# check body only for seminar details. headers would confuse it
:0bw: seminar.lock
* < 512000
HEADER=|.procmail/seminar.pl
# added size check 9/5/2005
:0hfw: seminar.lock
* < 512000
|formail -A "$HEADER"
:0:
* ^X-Dupe-Seminar: Yes
mail/dupes