An Application Agnostic Review of Current Spam Filtering Techniques


This is very old. Don't take this as current or authoritative. I will leave this up for the time being because there is still some good info to be gained from this document. 


This paper looks at the major spam filtering techniques in current use. In looking at methods both success rates and possible problems with each method are explored. Methods discussed include key word filtering, open relay filtering, open proxy filtering, dial-up filtering, non conforming mailing list filtering, cooperative sharing of spam samples, known spam origin filtering Bayesian filtering, Markovian discrimination, gray listing and challenge response.


An Application Agnostic Review of Current Spam Filtering Techniques


Daniel Owen


[email protected]


This paper looks at the major spam filtering techniques in current use. In looking at methods both success rates and possible problems with each method are explored. Methods discussed include key word filtering, open relay filtering, open proxy filtering, dial-up filtering, non conforming mailing list filtering, cooperative sharing of spam samples, known spam origin filtering Bayesian filtering, Markovian discrimination, gray listing and challenge response.


The Problem

The first ever spam message was sent on March 5, 1994(Moody, 2004). In the last 11 years spam has expanded to comprise approximately 65% of all e-mail (“Filtering Technologies in Symantec Brightmail AntiSpam 6.0,” 2004). As spam becomes more prevalent it threatens to make e-mail unusable. With this in mind, I will review several different approaches to spam filtering. Special attention will be paid to how these different types of filters operate, how they collect data and problems that the filters themselves can present.

In the last 11 years spam has expanded in magnitude from a problem that shocked Usenet users who could not believe that someone would be so crass as to advertise on the Internet but didn’t hinder normal Internet communications to a problem significantly pervasive that national governments are trying to find a way to stop, or at least limit, the amount of spam received by Internet users. To get an idea of why spam is so despised by average e-mail users and systems administrators alike you must look at the amount of spam that is sent on a daily basis. Every day AOL filters 2.4 billion spam messages. That translates to blocking 70 e-mails per user per day (Vaughan-Nichols, 2003). As an example of how bad things can easily get if spam is not curtailed consider, there are 24 million small businesses in the United States. If 1% of these companies got your e-mail address and send one message per year you would have an increase of 657 extra e-mails every day (Schwartz, 2003).

Beyond the annoyance factor, there is a cost to the spam recipient. This cost can be either in lost productivity or the monetary cost of filtering spam. Assuming that an employee can accurately delete all spam in thirty seconds per day a company with 10,000 employees can expect to spend $675,000 per year on spam deletion (“The State of Spam,” 2003). Home users do not get off without a high monetary cost. AOL reports that they spend 15% of their users’ monthly fees on fighting spam and responding to complaints (Gaspar & Gaudin, 2001). It is obvious that spam cost a substantial amount of money for the recipient yet the cost to the sender is minimal. May anti-spam advocates go so far as to say that spam is the equivalent of postage due advertising since the largest part of the cost is born by the recipient not the sender.

One final consideration for why e-mail is a problem is that much of what is sold is offensive or fraudulent. While there has not yet been a reported case of a company being sued because an employee received offensive spam e-mail many human resource managers worry that this could happen. Since people sending spam e-mail know nothing about their recipients it is not uncommon for children to be the recipient of sexually explicit e-mail. This is obviously a concern for many parents. Finally, much spam advertises fraudulent merchandise. According to the Federal Trade Commission two-thirds of all spam contains deceptive or false text (Cox & Dyrness, 2003).

Filtering Techniques

Considering the problems with spam it is not surprising that numerous different techniques have been developed to automate the filtering and deletion of spam. These techniques each have their relative strengths and weaknesses.

Static Black Lists

The oldest method of filtering spam is to use a blacklist. Blacklists are static lists made up of people, words or groups that have a high probability of being spam. At the simplest level a blacklist can be a list of specific e-mail addresses set up in an end user’s mail program.

Word Lists

The simplest form of blacklists is the word list. The idea is that certain words should never show up in legitimate e-mail so any e-mail that contains one of those words must be spam. This type of filter is typically deployed at the single user or at most the single domain level. The choice of words and phrases is extremely important in this type of filtering because almost any word can conceivable eventually end up in a legitimate e-mail. In my experience, the most effective methods of using key word filters is filtering on domain names, e-mail addresses and carefully selected phrases found in existing spam. Due to the high amount of precision that must be exercised in creating rules, this type of filter has a tendency to have relatively high levels of false positives. One reason for the high false positive rate is that a single use of a “bad” word can get a message that otherwise looks completely innocent blocked. Static key word filters also require an extensive amount of upkeep to add new spam words. Spam techniques and products change at a rapid rate necessitating an equally rapid change in filtered words. A static keyword filter that is reasonably successful will within a few months become nearly useless if the key words are not continually updated.

Open Relays

In the early Internet it was not uncommon for e-mail administrators to allow anyone to send e-mail to anyone else regardless of whether either person had an account on the server that was relaying the message. This behavior is the definition of an open relay (“Open Relay Database FAQ,” 2004). Some of the least scrupulous senders of spam use open relays as a way to hide their tracks and offload most of the already low cost of sending their messages to a third party. Spam operators that try to maintain a façade of legitimacy typically avoid using open relays. There are a couple of reasons for this. The use of an open relay destroys any hope of seeming legitimate and secondly it’s hard to claim that use of an open relay is not criminal computer trespass.

Blocking of open relays has certain advantages and disadvantages. Blocking open relays will cut down the amount of spam received proportionally to the amount of spam that is funneled through vulnerable systems. Unfortunately some legitimate e-mail may also be blocked if a legitimate correspondent is using an Internet Service Provider (ISP) or corporate mail server that has not been properly secured. In today’s environment responsible system administrators are very quick to fix any misconfiguration that might leave their servers exposed as an open relay therefore the amount of legitimate e-mail blocked should be minimal.

Open Proxies

Open proxy blacklists are somewhat similar to open relay blacklists in that they try to stop spam operators that target misconfigured servers. An open proxy allows a spammer to send e-mail through a mail server that they would typically not have access to by making them appear to the mail server as if they were a local user (Farmer, 2003). Open proxy blacklists have similar advantages and disadvantages to open relay filtering.

Dial-up Blacklists

Dial-up blacklists are lists that are designed to block any traffic that comes from a network address that corresponds to a consumer oriented ISP. These may be actual dial up accounts or high speed Internet accounts. The idea behind this type of list is that people in these networks should not be sending e-mail directly to other e-mail server. All e-mail should be sent through their ISP’s e-mail server. Therefore there should not be any harm is blocking e-mail traffic from the portions of these networks assigned to end users. A great deal of spam has been sent using consumer ISP services through the years, so this does seem like a logical approach. Some of these messages are sent when spam mailing companies sign up for “throw away” Internet accounts. A relatively recent twist in the spam story is that some spam mailing companies have begun to hire virus writers to create viruses that allow them to send e-mail through infected home computers that act as either open relays or open proxies (Leyden, 2004). These infected computers are another reason for spam to come from these parts of the Internet that should not typically contain servers.

Consumer oriented ISPs have been estimated to account for between 30% and 80% of all spam being sent today (Bray, 2004). This makes it fairly obvious that a large proportion of spam can be stopped by simply blocking anything that comes from a consumer ISP. The major problem with these lists is that some small companies and home computer enthusiasts may operate their own mail servers but not be large enough or well funded enough to be able to purchase ISP service from a company that is not listed on these lists. There is a simple solution to this problem. Individuals or companies using services that are predominately consumer oriented should simply relay all e-mail through their ISP’s mail server. Since some people do not understand the problem of operating mail servers on these networks there is a false positive issue that must be considered.

Non Confirmed Mailing Lists

Some mailing lists on the Internet do not confirm the legitimacy of new subscriptions. These are typically referred to as single opt-in or non confirming mailing lists. Non confirmed mailing list signups can be abused by unscrupulous mailing companies who will add people to mailing lists and then claim they signed up. These lists can also be abused by malicious individuals who subscribe a target to numerous lists as an annoyance. Some black hole list operators consider these mailing lists spam regardless of whether there have been complaints or not (“Detailed End User Information for MAPS NML Listings,” 2004). These blacklist operators advocate double opt-in or confirmed mailing lists. The difference being that in a double opt-in list the person subscribes and then receives a message that they must respond to confirming that they really want to subscribe to the mailing list.

Most, but not all, companies that operate legitimate mailing list have moved to double opt-in as an effort to stay off of blacklists. The disadvantage of using this type of black hole list is that there may be some legitimate mailing list e-mails that get dropped in the process of filtering out the spam. As a general rule a false positive on a mailing list is considered less serious that a false positive on a personal e-mail but they still can be a problem.

Cooperative Spam Signatures

A method of filtering spam that is beginning to pick up popularity is cooperative sharing of spam signatures. This technique is similar to the method used by virus scanners in that a sample of a spam message is used to create a hash of the message. Unlike virus scanners the hash creation is automated as opposed to being a task undertaken by a human. Also unlike virus scanners all or most of the message is used for hash creation while virus scanners typically rely on finding unique signatures within virus programs. After a sufficient number of people report the message as spam future recipients of the message will be able to automatically filter the message (Mertz, 2002).

This method is by definition more reactive than some of the other systems for spam filtering in that it relies on several people receiving and reporting the same piece of spam before it will be filtered. There is a similar problem inherent in signature based virus scanners in that they can not stop a new piece of malicious software until they have seen samples to create signatures from. Many spam mailers will use hash busters that make each message statistically unique therefore creating a different hash. The cooperative spam lists all attempt to minimize the effect of has busters by using only certain parts of the message that are less likely to contain hash busters.

In theory there should be a near zero false positive rate because e-mail must be reported by multiple people and your legitimate e-mail should be impossible to report since only you receive it. False positives can slip into the system in three ways. People forget that they are subscribed to mailing lists and report them as spam. Secondly some current implementations of this method allow system administrators to configure other spam filters to send a copy of any e-mail that appears to be spam to the central signature server. This means that if a mailing list gets incorrectly identified by other filters it may be reported to the central server as well. Finally it is possible, although highly unlikely, that a legitimate e-mail and a spam e-mail could end up with the same hash if the hashing algorithm creates hashes that are not perfectly unique. In researching this I did not find any examples of this type of theoretical false positive. False positives should fall into the category of mailing lists meaning that while these false positives are problematic they are less of a problem than false positives on personal correspondence.

Known Spam Origin

The final type of static list I will discuss is the known spam origin blacklist. These are lists that are comprised of email originating from IPs that have previously sent spam either to a user of the system or to a decoy address ( FAQ, 2004).

The major problem with the spam origin lists is that they are not particularly effective and have one of the highest false positive rates of any spam filtering technique. As an example, in research completed by Giga Information Group the black list provider Mail Abuse Prevention Systems, LLC (MAPS) was found to successfully block only 24% of spam but more worrying there was a 34% false positive rate (Gaspar & Gaudin, 2001).

One of the reasons for the high level of false positives by MAPS and some other known spam origin lists is that a vigilante mentality can grow in the groups that operate the lists. One common approach taken by these groups is to block “spam support” organizations. What is often means in implementation is blocking an entire ISP’s network space if they cannot get the ISP to drop a single spammer.

The policy of intentionally blocking innocent customers that happen to share networks space with a spammer is called overblocking. As an example of how extreme the overblocking can be, in February of 2002 Spam Prevention Early Warning System (SPEWS) added all of Interland’s 400,000 customers to their back list because Interland had not removed 100 customers that SPEWS accused of spamming (Wagner, 2002, May 23).

These techniques are effective. Many large ISPs have caved under the pressure of having their legitimate customers blocked because they were allowing a few spammers to operate using their network. While overblocking is effective for convincing ISPs to remove known spam operators from their network it also leads to very high false positive rates making these services unusable for anyone who considers false positive results to be a problem.

Statistical Filters

There are a few different statistical models that have been discussed in the academic literature but few methods are currently in production products that identify their filtering method. I will discuss Bayesian filtering and Markovian discriminators. It is possible that there are other statistical models that are in use in proprietary closed systems but since they are by definition closed it is impossible to consider them independently.

Bayesian filtering

Bayesian filtering is based on Bayes’ Theorem. The common implementation assumes that all words in a given message are not related thus, the filter in intentionally naïve and is referred to as naïve Bayesian filtering. A corpus of both spam and legitimate e-mail, referred to as ham, is collected to base filtering on. The filter looks at each word in a message and by comparing the probability of that word being in a spam or a ham messages gives it a score. When looking at new messages the filter will take scores for words from the message that have the highest probability of being either spam or ham words and gives the message a score indicating that the message is either ham or spam.

Properly trained naïve Bayesian filters have reported very high filtering rates with some of the lowest false positive rates seen in any spam filtering methods. One technique that is used to reduce the number of false positive results is the doubling of non-spam words. This means that a word found in a non-spam message is twice as important as the same word found in a spam message. This helps to bias the filter toward slightly high false negatives but substantially lower false positives. False positives being a more significant problem this is a logical tradeoff.

One crucial issue for Bayesian filtering is the training of the filter. The more e-mail the filters sees the more accurate the assumptions about words will become. The major weakness for Bayesian filtering is that it is ideally used at the individual user lever instead of at the mail gateway level. Essentially, the filter is more capable of learning the quirks of a given users good and bad words than it is of learning numerous users good and bad words since different people will have different requirements for what needs to make it through the filter. Even though this is the case several products do successfully implement naïve Bayesian filtering at the gateway level even though the success rates do take a hit (Graham, 2003). The more similar the group being filtered the more likely that naïve Bayesian filters will have results similar to those of a single user. As an example, a group of doctors will be more likely to receive drug names in their regular e-mail than the population as a whole therefore if those doctors are grouped together the false positive rate at least for those typically highly spam indicative words will remain low but if you group those same doctors with the population as a whole you will see a rise in the doctor’s false positive rate since most of the population as a whole does not receive a great deal of legitimate e-mail with large numbers of drug names in them and the other users of the system may see a slight decrease in the effectiveness of the filters for drug related spam.

An approach that tries to improve on existing Bayesian filtering is looking at word group and the number of times that words repeat. This is probably the future of spam filtering as spam marketers become more adept at circumventing the existing single word Bayesian statistical spam filters. This approach has many of the same advantages and disadvantages inherent in naïve Bayesian filtering. The hope is that as the techniques are improved multiple word filtering will improve even further on accuracy. A disadvantage of this form of filtering is that it does take much more storage space to store all of the seen two word combinations and probabilities (Burton, 2004).

Markovian Discrimination

The major problem with Naïve Bayesian filtering is that it is by design naïve to the fact that word groupings are significant. Naïve Bayesian filtering looks at each work independently without regard for words around it. This can lead to successful attacks on the system such as adding random words in an attempt to include clean words thus reducing the spam score. This is difficult for the spam mailer since each person will have different spam and ham words but it is a popular method used by many spam mailing companies. Markovian discrimination looks at groups of words found in spam mail. Depending on how closely a group of words can be modeled to known spam word groups the higher the score given to the group. This group method has shown promise in testing. The overall improvements are small but this is primarily due to the fact that existing Naïve Bayesian filtering is already hovering around a 99.9% success rate so improvements while significant may look minimal (Yerazunis, 2004). In existing implementations false positive rates seem to be similar to naïve Bayesian filtering.

Challenge Response

Challenge response works by sending a challenge message to the sender every time a message is received from a new e-mail address. The challenge can be anywhere from simply replying to the message to as difficult as following a link and entering a code along with personal information.

Challenge response was practically 100% effective against spam when it was first implemented. Some spam senders are now using fake addresses to send from that can include other known good addresses in a corporate domain or sending from the address of the person the e-mail is sent to. These types of attacks can bypass challenge response systems if the address sent from has previously sent e-mail to the recipient. Some spammers may have an auto reply system set up that will allow them to become part of the white list in challenge response systems that simply require a reply from the sender.

The biggest problem with challenge response is that it has a high rate of false positives and exerts a burden on legitimate senders of e-mail. False positives are caused for several reasons. People forget to white list mailing lists that they have signed up for and then wonder why they quite receiving e-mail. Some people refuse to reply to these systems out of principal stating that they feel that challenge response systems are akin to the recipient saying “My time is more important than your. Fix my spam problem for me.” Finally if two people who have not sent e-mail to each other are both using a challenge response system they will never see each other’s challenges since their systems will continue to challenge each other. If challenge response becomes a sufficient annoyance to spam operators they will eventually find ways to automate replying to the challenges.

Gray Listing

Gray listing uses the fact that some spam operators use e-mail servers that blindly send e-mail but do not retry failed connections. Gray listing has a side effect of also stopping mass mailer worms that have their own mailing engine since these mail engines typically blindly send e-mail. Even spam operators that run full featured e-mail servers may turn of mail retries in an effort to save resources. Gray listing sends out a temporary error message to the sending server. Legitimate mail servers receive these and interpret them as a need to wait a period of time and then try again. All legitimate e-mail should get through unless the sending server is massively misconfigured. The major downside is that during the gray listing period there is a communication delay that some end users may find unacceptable. Gray lists typically cache servers that have successfully resent e-mail. This means that until the cache expires there is only a delay the first time that someone sends from an unknown server. Gray listing also has an unintended consequence of producing some extra work for legitimate mail servers since they have to connect multiple times to send a message. Gray listing will not stop all spam because some spam operations will send retries. If gray listing becomes popular it is likely that the effectiveness of gray listing will be reduced significantly as spam mailing companies start to retry e-mail addresses when they get a temporary failure. Gray listing will have a lower success rate than most other spam filters but it can be effective when used in combination with other spam filtering techniques.

Problems with gray listing generally come down to systems not properly handling rejected messages. Improperly handled messages can include mailing lists removing a subscriber after a single bounced message. This is a fairly aggressive stance for the mailing list operator since a temporary failure is bound to happen from time to time with even the best maintained mail servers. More worryingly some versions of Lotus Notes have been reported to not handle messages that have a temporary failure condition. It is possible that other mail transport agents also have similar problems. Once again this is a problem that should be fixed by the sending mail admin but this must be considered when evaluating gray listing. One final annoyance is that some sending servers may have extremely long retry time. This can lead to e-mail delivery being delayed for several hours.

Combined Filters

System administrators have a final option when deciding to implement a spam filter. This option is using a system that mixes the best of the different types of systems to create an overall solution for that organization’s spam problem. Most commercial packages and many of the open source solutions make a mixed approach an option. The only approach that I have seen commonly implemented without the use of any other methods as backup is statistical filtering at the individual user’s mailbox level. By mixing different approaches the administrator has the option to weight different filtering techniques with an appropriate level of trust.


Data Collection

The data collection method can have a significant impact on the reliability of the results returned by the filter. As such, this is an important consideration. Different methods are appropriate for different techniques.

Static blacklist filters have three major ways that they collect data. They either scan for servers, use decoy addresses or use a nomination system.

Scanning for servers works well for open relay and open proxy blacklists. Since these are both conditions created when a system administrator has incorrectly configured the system in question it is easy for the blacklisting service to scan for these servers. Actually, what the blacklisting services that use server scanning do is fairly similar to what spammers looking for servers to exploit do. Both groups will scan large portions of IP space for any servers that are configured as open relays or open proxies. Only their motives differ. Scanning servers does have the down side of, regardless of the intent, looking like an attack to monitoring systems on the scanned network. There have also been issues with systems scanning servers actually causing them to crash due to bugs inherent in the server software. (Wagner, 2002, March 20).

Decoy addresses are addressed that are specifically set up to receive spam. A real user does not ever use these addresses so there should not be any legitimate e-mail going to the address. Typically these addresses will be included as hidden text inside of a web page or in other public forums where addresses are harvested for spam. This allows automated programs that look for e-mail addresses to find them without regular users getting snared in the spam trap. Anyone who sends to one of these addresses is assumed to be a spammer and added to the blacklist.

Nominations can be a blessing or a curse. They give real users who receive spam an outlet to report the spam to someone who can hopefully do something about it. Unfortunately there are issues with people sometimes forgetting about subscribing to a mailing list and then later reporting it as spam. I manage a small 30,000 user double opt-in list. I have subscribed to a service through AOL so I see any messages sent to AOL users that create a spam complain. In the last few weeks I have had at least one person every week send a spam complain about the mailing list confirmation message. These are at least in theory people who a mater of a few minutes earlier had put their e-mail address into a web form asking to receive e-mail from the list they are complaining about. There are also always a few complaints every time we send a message. I believe some of this may be because people read the subject and mistake it for spam but I also feel that a great deal of these incorrect classifications of spam come from people forgetting that they signed up for the list. As this anecdotal evidence implies individual end users may not always be the best way choosing what is spam in a distributed system where many people may be affected if they misclassify mailing list messages as spam.

The last method of data collection is statistical analysis. At the simplest level these consist of a file that contains the probability of every word that had previously been seen in an e-mail message. Based on these probabilities new messages are assigned a probability of being spam.

More complex systems may use groupings of two or more words. This should help to improve accuracy by looking at the writing style of spam and legitimate messages. Multiple word statistical approaches will require a much larger corpus of training messages to give the filter the ability to see as many different combinations of word groups as possible.

How Spam Filtering Slows Spam

These approaches to spam filtering have two ways that they help fight the spam problem. One, by blocking spam end users do not have as many garbage messages to go through. Most people are not concerned with having to delete a few garbage messages but the amount of spam has reached a point where individuals manually deleting spam have to either take a productivity reduction in carefully scanning through their e-mail or they will themselves start creating false positives by accidentally deleting legitimate messages as spam. Most estimates put human spam filtering at a lower success rate for both false positives and false negatives than highly trained statistical filters. Secondly beyond removing the annoyance factor for most users as filters become more effective more ISPs will be able to filter e-mail without having to worry about stopping their client’s legitimate e-mail. As messages are tagged or deleted before they reach the end user it will be more difficult for spam senders to get their message through to the very small minority that actually buy their products. This will lead to higher costs of operation and lower profits for spam mailing companies. If the filters are successful enough they may even remove the profit motive completely.

Spam has made e-mail less useful and more expensive and the problem is only getting worse. In late 2004 stand at a 65% spam rate and the amount of spam doubles every 12-18 months. Now is the time for e-mail filtering to become prevalent. There are numerous different methods of filtering and the goals of the person doing the filtering will to a great extent determine which method they chose to employ. The question is no longer whether to filter spam but what method to use in filtering e-mail.


Bray, H. (2004, June 9). Home PCs big source of spam. Retrieved November 16, 2004, from

Burton, B. (2004). SpamProbe - Bayesian spam filtering tweaks. Retrieved October 17, 2004, from

Cox, J. & Dyrness C. (2003, May 28). Spam prevention may lead to filtering of legitimate messages [Electronic Version]. Knight Ridder Tribune Business News, 1.

Detailed end user information for MAPS NML listings. (n.d.). Retrieved November 16, 2004, from

Farmer, J. (2003, December 27). An FAQ for part 3: understanding NANAE. Retrieved October 3, 2004, from

Filtering technologies in Symantec Brightmail AntiSpam 6.0. (n.d.). Retrieved November 15, 2004, from

Gaspar, S. & Gaudin, S. (2001, September 10). Spam police. Network World, 18(37), 58-62.

Graham, P. (2003, January). Better Bayesian filtering. Retrieved October 17, 2004, from

Leyden, J. (2004, May 14). Spam fighters infiltrate spam clubs. Retrieved November 16, 2004, from

Mertz, D. (2002, August). Spam filtering techniques: Comparing a half-dozen approaches to eliminating unwanted email. Retrieved November 16, 2004, from

Moody, G. (2004). Spam's tenth birthday today. Retrieved November 16, 2004, from

Open relay database FAQ. (n.d.). Retrieved November 16, 2004, from

Schwartz, E. (2003, July/August). Spam wars. Technology Review, 106(6), 32-39. FAQ. (n.d.). Retrieved November 13, 2004, from

The state of spam Impact & solutions. (2003, January). Retrieved November 13, 2004, from

Vaughan-Nichols, S (2003). Saving private e-mail. IEEE Spectrum, 40(8), 40-44.

Wagner, J. (2002, March 20). Facing Legal Challenge, Blackhole List Closes. Retrieved January 8, 2005, from

Wagner, J. (2002, May 23). When spam policing gets out of control. Retrieved October 17, 2004, from

Yerazunis, W. S. (2004). The Spam filtering Accuracy Plateau at 99.9% Accuracy and How to Get Past It. 2004 MIT Spam Conference, January 18, 2004. Retrieved December 22, 2004 from

Note for HTML version. If I refer to a print article that is what I used in my research. I have attempted to find web versions for print articles that I used. There may be some differences between print and online versions of articles.

Note on links in general. Unlike print publications web sites can change their articles so there may be some changes between when I originally looked at a web page and now. is a good way to look at web sites as they were at a previous date.