Anti-Spam Tips/Rant

These tips are now obsolete--get Spamassassin instead! It's much better than any of the tips listed below.


I get a lot of spam. I don't like spam. Spam is bad. Filtering by From: headers or subject lines is generally ineffective due to forged From: headers and the fact that nearly every message has a different Subject: header.

First, a few questions and answers:

What is spam?

Actually, this is an interesting question. The term originally referred only to a canned meat product by Hormel, which allegedly resembles ham. This meaning continues to apply, and the term is properly spelled with an upper-case "S" when used in this context.

More recently, the term has come into use in relation to electronic communications. One of the two most common meanings for "spam" (lower-case "s") in this context is that of posts to Usenet groups which are essentially commercial advertisements; these are usually (but not always) posted to multiple groups, and are almost always off-topic for the groups to which they are posted. Since commercial advertising is prohibited on Usenet outside of the biz.* hierarchy, excessive crossposting (ECP) or excessive multiposting (EMP) of commercial advertisements tends to annoy many people, as the practice tends to drastically reduce the signal-to-noise ratio of Usenet groups, and, more importantly, forces others to subsidize the cost of someone's advertising. There are various content-neutral definitions for Usenet spam, which helps to reduce the cries of "censorship!" when third parties send cmesg cancel messages for these posts; most of these definitions tend to relate to the number of groups to which the message was posted, on the (reasonable) assumption that no message can possibly be on-topic for more than a small handful of groups. This is an interesting issue in itself, but is not one which will be discussed here.

The definition of "spam" (lower-case "s") in the context below is as follows: unsolicited commercial electronic mail (UCE). This consists of email messages sent to individuals who have not explicitly agreed to receive them. This is a fairly new practice, which did not become widespread until 1996 or 1997, and which grew fairly rapidly since then. The idea behind this is fairly simple: electronic mail costs nearly nothing to send (since the recipient partially subsidizes the bandwidth and storage costs), and, thus, makes it very easy for an advertiser to send out thousands or millions of advertisements for basically zero cost to himself. The idea is that even a very low response rate can be profitable.

Why is spam bad?

This should be fairly obvious, but I will provide some quick notes here anyway:

First and foremost, email spam (UCE) forces the cost of advertising onto the recipient rather than the advertiser. It is as if someone were to send postage-due "junk mail" via the US Postal Service. Clearly, this would be unacceptable to nearly everyone, and few would actually pay the cost to receive it. Unfortunately, with email spam, the recipient simply does not have the option to refuse delivery of such junk mail, as simply having a valid email address makes one a target.

Many will respond to this by saying something like "I pay $19.95 a month for unlimited Internet access! I don't care how much email I get, since I'm not paying by the message, anyway." This is true in the short term, but realize that it costs money to provide email service: the ISP must provide bandwidth and storage capacity for its user. If the volume of incoming email suddenly increases, the ISP must upgrade its facilities. It is more than just a matter of disk space as well. Disks need to be purchased, installed, backed up, and replaced when they fail. The cost of server-grade disk space (SCSI or fiber-channel, RAID arrays, backup devices) is nontrivial. Further, dialup ISPs suffer if customers need to stay connected to their networks for longer periods of time to download or read the extra mail volume, and thus will need to increase their dialup line capacity. Finally, bandwidth (information-carrying capacity) is not free, and spam (and the resultant user complaints) contributes significantly to the volume of information which must pass across the ISPs network link (and the backbone links which connect it to the rest of the world), all of which will increase costs over time.

Second, and somewhat related to the above, spam tends to reduce the overall reliability of an email system. This can happen in several ways, usually involving the overloading of the mail server. Additionally, many ISPs impose a disk quota or limit on the total size of a user's mail spool. If the user's incoming mail volume is sufficient to exceed that limit, the user will either be charged for use of the extra disk space, or all mail sent to the user will bounce after the limit is reached. Either event is clearly undesirable.

Finally, spam does not scale. If users are annoyed about having one or two spam messages in their mailboxes every day, they will be very angry about having fifty, and totally overwhelmed if they have a thousand. Unlike other advertising methods, spam is essentially cost-free to the sender, and so there is no real desire to even attempt to send spam only to those who might be interested in the product or service that is being advertised. Imagine if every advertiser decided that spam was the next great thing and started sending out millions of email messages per day. Clearly, the ratio of spam to useful mail would quickly grow so high as to make email useless for general communication.

Why are they spamming me?

"Where did they get my email address?" is a common question. Most spammers seem to buy lists of addresses that are compiled by others. These lists seem to derive their addresses from several sources. Anyone who has ever posted anything to Usenet is an obvious target, as these addresses are very easy for spammers and their cohorts to "harvest" efficiently. Because collecting these addresses is so easy, Usenet posting is the easiest way to get on a spammer's list. Additionally, spammers tend to collect addresses from various web pages, guestbooks, etc. Some ISPs and systems have mechanisms which were designed when the net was a friendlier place which will explicitly give out lists of usernames: finger service and Sendmail's EXPN commands, for example, are popular among the more creative spammers; unfortunately, these commands also have substantial legitimate use and probably should not be disabled for most sites.

Interestingly, some users tend to get more spam than others. Those who post to any of the alt.* groups on Usenet tend to be major targets (which is a problem since I post to alt.movies.silent, which is a legitimate group which has plenty of well-informed participants, yet seems to get lumped into the same category of alt.sex.fetish.furry and other groups which tend to attract a less enlightened crowd). Users who have email addresses in the "aol.com" or "hotmail.com" domain are also frequent targets, as are users of mailing lists (since most mailing-list software [listserv, majordomo, etc.] makes it relatively easy to generate a list of subscribers to the list).

Users who are listed as contacts in the InterNIC's whois database are also often targeted, again, because it is relatively easy to write a small program to quickly generate large lists of addresses using this service. Again, however, this is a useful enough service that closing off public access is not an acceptable option. It is actually kind of amusing that spammers would mine through the whois database for addresses, as the people listed there tend to be network administrators and others who would tend to be very technically savvy and vehemently anti-spam.

Finally, users who have email addresses in the .edu and .gov top-level domains seem to get less spam than those who have addresses in the .com, .net, and .org TLDs, although getting such an address is not an option for many people.

What to do about spam?

Address munging

Bad idea! Some people like to post to Usenet with forged From: headers, like <no@spam.com> and other such nonsense. This is bad for several reasons: first, it violates the standard for conduct on Usenet, RFC1036. Second, it is considered antisocial behavior, as it makes it very difficult or impossible for legitimate users to get in touch with the poster via private email. Third, not posting with one's legitimate email address and full name tends to reduce the credibility of one's posts. Finally, if not done properly, this sort of activity will cause excessive use and abuse of someone else's resources. In the example above, the owner of spam.com (which happens to be Hormel) would have his DNS resources used if someone tried to send mail to <no@spam.com> and, if that site happened to be running mail service, would have his mail server and bandwidth capacity used up, whether or not the message bounced.

A slightly better idea would be to munge an address to something like (in my case) <snorwood+nospam@redballoon.net> -- a far superior alternative for many reasons. First, in most cases, "plussed" addresses are valid addresses on systems running Sendmail, thus side-stepping the RFC-violation issue as well as allowing legitimate contact. Second, the incoming mail to this address can easily be filtered and dumped into a "probably-spam" mailbox, to be reviewed at a more leisurely pace than the user's primary mailbox. However, this still doesn't really solve the problem of spam, and actually tends to perpetuate it, as spammers often sort through lists of addresses and remove any string resembling "nospam"; additionally, this also tends to confuse users who try to send legitimate mail to that address, as it is often unclear whether the address is actually valid.

Turn off third-party mail relay

Most spam is relayed through servers belonging to some entity other than the sender's ISP, in an attempt to avoid identification. This tends to invite even further abuse of resources, of course. Anyone who runs a mail server of any type (ranging from ISPs to DSL users running a Unix variant who have Sendmail configured) should turn off third-party mail relay. It's a marginally useful feature and was once considered to be a polite thing to allow, but which now needs to be turned off in order to avoid contributing to the spam problem (and to avoid receiving hundreds of bounced messages in the postmaster mailbox).

Speak up!

A much better response to spam is to complain loudly to the offender's ISP. Most legitimate ISPs now have something resembling a "no spam" clause in their terms of service and will disconnect customers who use their service to send spam. Learn how to read mail headers and use the whois database. Ignore the From: header in the message, since spammers usually forge it; instead, determine the actual originating site from the mail headers. If the message originated at the "example.com" domain, send mail to <postmaster@example.com>, <abuse@example.com>, and <root@example.com>. Don't be surprised if some of these bounce; the only valid address that every site is required to have by RFC822 is "postmaster." Also send mail to any of the contacts listed in the whois database. If the sender was coming from an IP block in a different domain from where he relayed the message, use whois.arin.net to find the owner of that IP block and complain to that person as well. In all cases, be sure to forward the original spam message with full headers intact, so that the source can be traced. If you have the time, phone calls to the offender's ISP are often more effective than email complaints. If ISPs become annoyed enough at the complaints, they will be increasingly likely to take action against the spammer and to be more careful about not selling service to potential spammers in the future.

If you can't be bothered to take these actions yourself, then consider something like Spamcop, which automates the reporting tasks.

DO NOT reply to any message and request to be "removed" from their spam list. This is totally ineffective (it will bounce most of the time, anyway), and actually can be counterproductive, as doing so proves to the spammer that your email address is not only valid, but that you actually read messages sent to it.

Also, support legislation efforts to make spam a crime (after all, it involves abuse of others' resources without their permission!). This is the only long-term way to ensure that the spam problem does not continue to worsen.

Filter it

This is the least desirable option, since it does nothing to solve the problem, yet is often necessary in order to make email useful again as a means of communication with individuals. In short, it isn't the solution, but rather a temporary band-aid which can help the situation until anti-spam legislation passes on a national level.

The problem with filtering spam is interesting, since the goal is to junk the greatest possible per centage of spam while having a near-zero rate of "false positives." Obviously, the aggressiveness of any filtering scheme should be determined by one's own preferences; personally, I feel that it is more important to keep the "false positive" rate as low as possible and less important to junk a large proportion of the spam messages. Your needs (and mileage) may vary.

The best method that I've found so far to filter out spam from email that I actually care about involves only two simple filters; together, they can get rid of about 10% of the junk email that I get, on average.

The first rule junks anything with a syntactically invalid Message-Id: header. The idea here is that any message with an invalid Message-Id: header was sent by some horribly broken piece of software. Since most people use mail client software which is reasonably well-behaved and since reputable software developers don't write spam software, this method will result in few false-positives and so anything filtered will be primarily spam.

The second rule junks anything which contains an "Apparently-To:" header, which is only inserted by Sendmail when the original message lacks either a To: or BCC: header, at least one of which is required by RFC822. Again, this will only junk mail sent from seriously broken mail software or mail which was incompetently forged (as is much spam).

Warning: If you happen to have a domain name registered with Network solutions, beware of the Apparently-To: filter rule. Their web-based domain modification forms are broken and send mail without a To: or BCC: header (at least one of which is required by RFC822). This causes Sendmail to insert an Apparently-To: header which will cause the message to get filtered into ~/mail/spam.procmail by the relevant procmail rule. I have complained about this blatant standards violation to NSI, but it has not yet been fixed as far as I know.

Another warning: I've talked with people who suggest dumping anything that does not have your own address in the To: header. To some extent, this works, yet I believe that it is a bad idea and will result in an unacceptably high number of "false positives." This is because many people send legitimate mail to large groups of people; when they do not want to disclose the recipient list, they will often put their own address or a known-bogus address in the To: header and everyone else's address in the BCC: header; the behavior of most mail clients is to issue an RCPT TO: for each of the addresses listed in the BCC: header, but not to pass along the BCC: header itself, resulting in a message with identical To: and From: headers. This can be used for legitimate purposes, and so I suggest not rejecting mail sent in this way.

Here's how to implement these rules using procmail. Put the following in your ~/.procmailrc:

:0
* ^Apparently-To:.*
{
   :0
   $HOME/mail/spam.procmail
}
:0:
* ! ^Message-Id:[       ]*<[^   <>@]+@[^        <>@]+>[         ]*$
$HOME/mail/spam.procmail

Then, if it is required for your system, put the following in your ~/.forward (obviously, you'll want to replace snorwood with your own username):


"|IFS=' ' && exec /usr/local/bin/procmail -f- || exit 75 #snorwood"

I don't know how to implement these recommendations with other filtering software or mail clients which don't use standard procmail syntax.

New addition!

If you can possibly filter out HTML email without annoying too many people who are using misconfigured or otherwise broken mail clients, this should get rid of a huge amount of spam! The above two filters plus the HTML filter should get rid of at least 50% of total spam as measured by number of messages. Here's something I wrote about how to set this up using procmail; I'll include it in the above information later when I have more free time (yeah, right).

I finally got this set up.  It is actually really simple; it uses
"vacation" to send the contents of .vacation.msg to anyone who sends me
stuff with "Content-Type: text/html"; it also saves a copy in
~/mail/htmlmail.

It seems to work with HTML mail that I have sent myself using the
Netscape mailer, but I have no other way to test it.  It requires
running "vacation -I" to initialize .vacation.pag and
.vacation.dir.  It definitely works on Solaris and Linux; I assume it
works on NetBSD as well.

My procmail knowledge is pretty limited; there may be a more efficient
way to do this that doesn't require vacation.  If you find it, let me
know.

From .procmailrc (probably should be the last thing listed):

     :0
     * ^Content-Type:.* text/html
     {
        :0 c
        $HOME/mail/htmlmail
        :0
        | /usr/bin/vacation -t1s snorwood
     }

From .vacation.msg:

     Subject: HTML email (autoreply)

     I am sorry, but I do not accept HTML-format email.
     ...
     [snip stuff about the spam problem, etc.]


Return to Home Page.
This page courtesy of "vi," the world's greatest HTML editor! :)
Valid HTML 3.2! This page last updated on July 23, 2002