buryspam

A Bayesian Spam Filter

Introduction

This page describes how to use the buryspam.rb script. For an informative and informal discussion of the practice and theory behind Bayesian spam filtering, please see Paul Graham's A Plan For Spam web page, which provided much of the inspiration for the implementation of this program.

Note that the script as well as this documentation is very much a work in progress. Please report any problems to donald@cs.mun.ca

Local Administrative Notes

The buryspam.rb script was developed using version 1.8.0 of ruby (Dated 2003-08-04). This version of ruby is currently installed on garfield. Other versions of ruby may or may not work as expected.

If the grad partition becomes inaccessible, mail delivery may be adversely affected if you call the script on the grad partition from your ~/.procmailrc file. procmail should try to recover messages that could not be processed due to a missing filter, but I've experienced mixed results in this respect.

To be safe, you may want to copy the the buryspam.rb script to your own directory instead of invoking it directly from /users/cs/grad/donald/pub/bin. This way, if the grad partition becomes inaccessible, your copy of the buryspam.rb script will still be accessible.

Is buryspam.rb appropriate for you?

Note that if you use an e-mail client that transfers messages off of garfield (using, for example, POP or IMAP), then the buryspam.rb filtering script may not be appropriate for you. This is because the script relies on having a collection of good and bad messages stored on the server for initialization purposes. It may be possible to use buryspam.rb on the client side to initialize the filter but the file resulting from the initialization would have to be transferred manually to the server. This may be more trouble than it's worth. Also, note that the buryspam.rb filter requires that all messages be stored in Unix mbox format.

buryspam.rb is more suited to those who read their e-mail using client software directly on the mail server itself (e.g. pine, mutt, elm etc.). You also have to be willing to modify configuration files and your .procmailrc file.

Getting started

  1. Move legitimate and spam messages to separate directories: In order for Bayesian filtering to be effective, one must have an existing collection of legitimate e-mail (this includes messages from mailing lists and personal mailings -- both from you and to you) and also a collection of spam messages. Typically, a few thousand of each is required for accurate results, but be careful not to use too many, otherwise the generation of the word probability file may consume too much time and resources. The folders or mboxes containing legitimate messages and spam messages should be placed in separate directories. For example, it is common to keep all legitimate messages in your ~/mail directory. Spam messages can be kept in a subdirectory of this directory (e.g. ~/mail/junk).
  2. Create a ~/.buryspamrc file: This file contains all the configuration options related to the bayesian filtering. Only three parameters must be set explicitly. You must set the good_dirs parameter to the directory containing the legitimate mail. If you have more than one directory that contains legitimate messages, set good_dirs to a comma delimited list of directory names. Next, you must set the bad_dirs parameter to the directory containing all the spam mail. As with the good_dirs parameter, you can specify a comma separated list of such directories. Last, you must specify the name of the file that will be used to store the word probability database (which is generated automatically below). Ideally, you should probably specify a file that is in a subdirectory of your ~/mail directory (e.g. ~/mail/lib). Make sure that you create this directory if is doesn't already exist. It is important to specify the full path name of the good/bad directories and the word database file. You may use the tilde ~ character to represent your home directory. All three of these parameters are strings and therefore must be delimited by double-quotes in the rc file. For example, the following minimal ~/.buryspamrc file indicates that legitimate messages are stored in the directories ~/mail and ~/mail/old directories; a collection of spam messages are stored in the ~/mail/junk folder and the word probability database file is ~/mail/lib/words
    
              good_dirs = "~/mail,~/mail/old"
              bad_dirs  = "~/mail/junk"
              word_file = "~/mail/lib/words"
    
    
    Again, remember that the directory containing the word probability database file (~/mail/lib in the example above) must be created if it doesn't exist. The actual word file itself will be generated in the next step.
  3. Run buryspam.rb --init: When supplied with the --init option, the script will scan through all the legitimate message and spam directories specified in the configuration file and count the messages and word frequencies present in each. Upon processing all the messages, the bayesian probabilities are calculated for each word and the word probability database (specified by the word_file configuration parameter above) is populated. This process will take some time (upto several minutes). If is taking too long, then you may want to terminate the process (by pressing Ctrl-C) and archive some of the older messages to other directories which will not be processed by the script.
  4. Create/update your ~/.procmailrc file: Make sure you have a ~/.procmailrc file whose contents resemble the following:
    	SHELL=/bin/sh
    	PATH=/bin:/usr/bin
    	LOGFILE=$HOME/mail/.procmail.log
    	MAILDIR=$HOME/mail
    
    	:0fw
    	| /users/cs/grad/donald/pub/bin/buryspam.rb --filter
    
    	:0:
    	* ^X-Bayesian-Spam: Yes
    	junk/spam
    
    Note that the junk directory specified above in the ~/.procmailrc file must be created in your ~/mail directory (if it doesn't already exist).

That's it. After this, all messages you receive will be sent through the bayesian filter. If it detects a spam message it will add the header line X-Bayesian-Spam: Yes to the message. Otherwise, the header line X-Bayesian-Spam: No is added. The ~/.procmailrc file will then deposit any messages with the spam header to the ~/mail/junk/spam folder and leave all your legitimate messages in your default mail spool. Extra header lines containing the actual bayesian value calculated for the message as well as the words that sparked the bayesian filter's "interest" can also be generated by setting the verbose_hdr parameter in your ~/.buryspamrc file to true.

Important Notes

Configuration File

The format of each line of the configuration file, ~/.buryspamrc, is:

parameter_name = parameter_value

Only the first three parameters given below compulsory, the remainder are optional and can be used to fine tune the filter. The defaults, many of which are from Paul Graham's web page described above, are usually sufficient.

The buryspam.rb script tries to be helpful in diagnosing errors in the rc file. If there are any errors in the configuration file, the script will simply pass any messages through unfiltered and then terminate. If the script is being run from your ~/.procmailrc file, errors will be logged to your ~/mail/.procmail.log file (provided LOGFILE is set appropriately in your ~/.procmailrc file).

The following options are supported by the buryspam.rb script in the ~/.buryspamrc configuration file.

good_dirs
Compulsory? Yes
Type String
Default ""
Description Comma separated string of directories containing legitimate e-mail mboxes. The directories must be fully specified (i.e. no relative pathnames), but the ~ character may be used to represent a home directory.

bad_dirs
Compulsory? Yes
Type String
Default ""
Description Comma separated string of directories containing spam e-mail mboxes. The directories must be fully specified (i.e. no relative pathnames), but the ~ character may be used to represent a home directory.

word_file
Compulsory? Yes
Type String
Default ""
Description Full pathname of the word database to use. The ~ character may be used to represent a home directory.

cache_dir
Compulsory? No
Type String
Default ""
Description This parameter determines the directory for the cache files created during initialization of the filter. The full path name must be specified and the parent directory must exist. The ~ character may be used to represent a home directory.

good_init_weight
Compulsory? No
Type Integer
Default 2
Description Word counts occurring in the messages in the good_dirs directory are multiplied by this amount during initialization of the filter. Setting this too high may result in more false negatives.

bad_init_weight
Compulsory? No
Type Integer
Default 1
Description Word counts occurring in the messages in the bad_dirs directory are multiplied by this amount during initialization of the filter. Setting this too high may result in more false positives.

good_select_weight
Compulsory? No
Type Integer
Default 2
Description During filtering, good and bad words are extracted in the same ratio as the good_select_weight to bad_select_weight (described below) ratio. This parameter biases the filter in favour of not treating messages as spam. Setting this too high may result in more false negatives. Setting this parameter to zero will turn off weighted selection.

bad_select_weight
Compulsory? No
Type Integer
Default 1
Description During filtering, good and bad words are extracted in the same ratio as the good_select_weight (described above) to bad_select_weight ratio. This parameters biases the filter in favor of treating messages as spam. Setting this too high may result in more false positives. Setting this parameter to zero will turn off weighted selection.

ignore_probs
Compulsory? No
Type Float Range
Default 0.3..0.7
Description Do not store words in the probability database whose probabilities lie in this range. For example, the default setting, 0.3..0.7, will cause the initialization routine to only write words to the database whose probabilities are between 0 and 0.3 or between 0.7 and 1. This may help speed up the loading of the word database during filtering as the word database will be (slightly) smaller.

ignore_words
Compulsory? No
Type Regexp
Default //
Description This parameter is used to determine which words to ignore during initialization. If a word matches this regular expression, then it is not written to the probability database. This parameter can be useful if spam starts to creep into a mailing list and you want to ignore words in the message's header that erroneously push the message into the non-spam category. By default, this parameter matches nothing.

ignore_mboxes
Compulsory? No
Type Regexp
Default //
Description During initialization, this parameter is used to determine which mailboxes to ignore when doing word counting and initial probability calculations. If the name of the mailbox matches the regular expression stored in this parameter, then the mailbox is ignore during initialization of the probability database. By default we do not ignore any mboxes.

min_word_num
Compulsory? No
Type Integer
Default 5
Description This parameter determines how many times a word must be encountered before it can be considered for inclusion in the word database as either a good or bad word. This parameter takes into account the respective weightings above. If this parameter is set too small, the word database will grow larger, possibly negatively impacting run-time performance. If it is set too high, the number of words stored in the database will decrease, possibly negatively impacting accuracy.

word_samples
Compulsory? No
Type Integer
Default 15
Description The number of words that the filter uses in calculating the probability that the message is spam.

verbose_hdr
Compulsory? No
Type Boolean
Default false
Description This parameter generates two extra header lines for each message that it processes. The X-Bayesian-Value: header line will indicate the probability that the message is spam and the X-Bayesian-Words: header line will contain the list of words that were used to judge the legitimacy of the message.

spam_threshold
Compulsory? No
Type Float
Default 0.9
Description Any messages whose bayesian probability are greater than this number are classified as spam. Other messages are treated as non-spam.

poison_threshold
Compulsory? No
Type Float
Default 2.0
Description If Bayesian analysis determined that the message was not spam, but the ratio of bad words to good words at the extrema is greater than this number, the message will be classified as spam. This should reduce the false negatives that occur when spammers deliberately attempt to poison filters by including a lot of superfluous words. When calculating the ratio, the remove_dups parameter is consulted to determine whether duplicate good and/or bad words should be counted when determining the ratio. Also, we require that the number of words used to calculate the ratio be greater than or equal to word_samples. To disable this feature, set poison_threshold to 0.

bad_prob
Compulsory? No
Type Float
Default 0.99
Description This is the probability assigned a word that has been deemed to be an extremely strong indicator of spam.

good_prob
Compulsory? No
Type Float
Default 0.01
Description This is the probability assigned a word that has been deemed to be an extremely strong indicator of non-spam.

default_prob
Compulsory? No
Type Float
Default 0.4
Description This parameter determines what probability the filter assigns to words that it has not encountered before in incoming messages. By setting this parameter low, you are giving the message sender the benefit of the doubt.

archive_file
Compulsory? No
Type String
Default ""
Description The name of the file which should hold verbatim copies of all incoming messages. This parameter was useful only for debugging purposes and should generally not be used.

word_length
Compulsory? No
Type Integer Range
Default 3..20
Description Words outside this range are not considered for inclusion in the word probability database. For example, with the default setting of 3..20, words that consist of only one or two characters or words that are greater than twenty characters are rejected for inclusion in the word database.

word_regex
Compulsory? No
Type Regexp
Default /[-A-Z0-9$'\x92\\_\[!]+/i
Description The regular expression used to determine what constitutes a word in a message.

decode
Compulsory? No
Type Regexp
Default /\.(te?xt|rtf|html?|scr|pif|exe|com|wpd|doc|xls|ppt|zip?)$/i
Description During initialization and filtering, the script will attempt to decode encoded message attachments for word extraction. This can increase the detection of spam messages which, for example, are base64 encoded. This parameter is a regular expression which is tested against the filename of each attachment. If it matches, the filename is decoded at word extraction of the decoded attachment will occur.

max_decode_part
Compulsory? No
Type Integer
Default 1048576
Description Encoded message parts that are larger than this configuration setting are not decoded. The specified size is in bytes, so the default setting will prevent encoded message bodies that are larger than one megabyte from being decoded. To decode ALL message parts, regardless of size, set this parameter to -1. To decode no message parts, set it to 0.

replace_8bit
Compulsory? No
Type Boolean
Default false
Description When set to true, this parameter will replace 8bit characters in (nonspam) text messages with their corresponding 7bit-clean counterparts during filtering. Such 8bit characters (e.g. 0x92 instead of ') are sent by Microsoft Windows MUAs apparently unbeknownst to the sender. This seems to be occurring more frequently.

spam_file_size
Compulsory? No
Type Integer
Default 5242880
Description When the spam file becomes greater than this size attempt to rename the spam file and let a new one get created. This parameter helps prevent spam files from becoming increasingly larger, thereby slowing down the initialization process. To prevent the renaming of the spam file, set this parameter to a value less than or equal to zero. The program attempts to determine the spam file based upon the contents of the ~/.procmailrc file.

strip_html_comments
Compulsory? No
Type String
Default limited
Description This parameter determines how HTML comments in messages are handled during initialization and filtering. Comments consist of text delimited by <!-- and -->. This parameter may be set to one of three modes:
all
In this mode, all HTML comments are stripped from the message.
none
In this mode, no HTML comments are stripped from the message.
limited
In this mode, if the word_regex regular expression contains the '-' character, then words in the comment that are directly adjacent to the beginning or ending -- HTML comment delimiters are preserved, while other words are dropped. Such words (in conjunction with the -- characters) can act as useful discriminators. All space-delimited words occurring in HTML comments are deleted.

strip_html_tags
Compulsory? No
Type String
Default invalid
Description During initialization and filtering, this parameter is consulted to determine how to handle HTML tags in HTML messages. The parameter can be set to one of three values:
all
In this mode, all HTML tags are stripped from the message.
none
In this mode, no HTML tags are stripped from the message.
invalid
In this mode, only HTML tags deemed to be invalid are stripped from the message. Using this value can help reduce false negatives by stripping "hidden" words inside tags which could poison bayesian filters.

strip_html_brackets
Compulsory? No
Type Boolean
Default false
Description When an HTML tag or comment is stripped using the strip_html_tags and/or strip_html_comments configuration parameters respectively, this parameter is consulted to determine whether or not the <...> brackets are stripped too. This parameter is set to false by default, because spammers use HTML comments/tags to break up words and the resulting fragmented words can be very helpful in isolating spam. By preserving the brackets, we ensure that the words will remain broken up during tokenization of the message.

remove_dups
Compulsory? No
Type String
Default "none"
Description During filtering, this parameter is used to determine how the filter should handle duplicate words during the evaluation of the message. The parameter can have four values:
all
Duplicate words are dropped during filtering.
none
No duplicate words are dropped during filtering. This means that every word in the message will be considered for evaluation even if it occurs several times.
good
During filtering, only duplicate good words are removed from the entire message. Using this value could decrease the false negatives at the expense of introducing more false positives.
bad
During filtering, only duplicate bad words are removed from the entire message. Using this value could decrease the false positives at the expense of introducing more false negatives.
header
Duplicate words (both good and bad) are removed only from the header of the message during filtering.
body
Duplicate words (both good and bad) are removed only from the body of the message during filtering.

epoch
Compulsory? No
Type Date
Default January 1, 1970
Description This parameter is used during automated diagnostic reporting and represents the start point of the filter testing.

seed_interval
Compulsory? No
Type Float
Default 200.0
Description This parameter is used during automated diagnostic reporting and represents how many days to seed the filter during its first initialization.

init_interval
Compulsory? No
Type Float
Default 14.0
Description This parameter is used during automated diagnostic reporting and represents the length of the initialization interval, in days, that should be used to initialize the filter starting from the first seeding period and for each test interval thereafter.

test_interval
Compulsory? No
Type Float
Default 14.0
Description This parameter is used during automated diagnostic reporting and represents the length of the test interval, in days, that should be used to test the filter. Each test interval is added from the end of the initialization period determined using the init_interval parameter, above.

slide_epoch
Compulsory? No
Type Float
Default 0.0
Description This parameter determines how many days the epoch should be moved ahead for each initialization during automated diagnostic generation. A value of 0.0 keeps the epoch constant.

uniq_serial
Compulsory? No
Type Integer
Default 4
Description During report generation and the renaming of the spam file, serial numbers are used in order to ensure unique file generation. This parameter determines how many digits to use in the serial numbers.

Usage

buryspam.rb <option> [mbox files]

For most (but not all) options, if no mboxes are specified on the command line, then input will be taken from stdin.

option can be specified as follows:

--initialize
Argument
Options
none
Short Form -i
Description Generate the word probability file. The filter must be initialized once in order to be used. Subsequent initializations may take place at the user's discretion (e.g. whenever a message is misclassified).

--filter
Argument
Options
none
Short Form -f
Description Filter the messages using Bayesian filtering. This option implies decoding of the message. The Bayesian filter has to be initialized (using --initialize) before filtering can be performed.

--decode-only
Argument
Options
none
Short Form -d
Description Decode messages only. No filtering is done and the resulting message is displayed to standard output. Useful for debugging purposes.

--colour
Argument
Options
none
Short Form -c
Description Show the messages with the 'interesting' words colour-coded. Bad words are shown in varying degrees of red, while good words are shown with varying degrees of green. This argument implies decoding.

--report
Argument
Options
none
Short Form -r
Description Batch filter all messages then generate a summary report (in the file summary.rpt of the filtering results. False positive and false negatives encountered are stored in files false_positives.* and false_negatives.* (where * represents a unique serial number) for subsequent examination by the user.

--grep
Argument
Options
<regex>
Short Form -g
Description Display all messages that contain the pattern regex. The regular expression must be delimited by /.../ and i m and x options may be specified at the end of the regular expression. If no mbox files are specified on the command line, then search through all mbox files in the good_dirs and bad_dirs directories. This option does not actually decode base64/quoted-printable messages when doing the grep.

--auto
Argument
Options
none
Short Form -a
Description Conduct an automatic, iterative test of the filter to assess its effectiveness. See the ~/.buryspamrc configuration parameters epoch, seed_interval, init_interval, test_interval and slide_epoch for more information.

--stats
Argument
Options
none
Short Form -s
Description Display monthly volumes of good/bad mail as well as the percentage of spam received during each month.

--verbose
Argument
Options
none
Short Form -v
Description Display some diagnostic data during processing. This considerably clutters up the output.

--begin
Argument
Options
'[good|bad] <date|relative-time>'
Short Form -b
Description For --initialize, initialize the filter using the good or bad messages dated on or after date. The good/bad parameter values are optional.

For --decode-only, --colour, --filter, --grep, --stats and --report, process messages dated on or after date.

A relative-time can be specified instead an absolute date. For example, 2 months will be translated into a timestamp two months in the past. Relative years or days may also be specified. All such times are relative to the present and occur in the past.

Often used with --end argument.

Cannot be used with --auto argument.


--end
Argument
Options
'[good|bad] <date|relative-time>'
Short Form -e
Description For --initialize, initialize the filter using the good or bad messages dated upto but not including date. The good/bad parameter values are optional.

For --decode-only, --colour, --filter, --grep, --stats and --report, process messages dated upto and including date.

A relative-time can be specified instead an absolute date. For example, 2 months will be translated into a timestamp two months in the past. Relative years or days may also be specified. All such times are relative to the present and occur in the past.

Often used with --begin argument.

Cannot be used with --auto argument.


--parameter
Argument
Options
<attr val ...>
Short Form -p
Description Override the configuration parameters in ~/.buryspamrc using the supplied attribute/value pairs. Can be useful during automated testing.

If standard input is available to the running script, then ALL options are ignored and the script runs in --filter mode.

Exactly one operating mode of --initialize, --decode-only, --colour, --filter, --grep, --stats, --report xor --auto MUST be specified.

If no valid mbox files are given on the command lines for the arguments --decode-only, --colour, --filter, --report or --grep then ALL the mboxes in the good/bad directories (as specified in ~/.buryspamrc) are used.

Sample ~/.buryspamrc File

	good_dirs 	= "~/mail,~/mail/old"
	bad_dirs 	= "~/mail/junk,~/mail/more-junk,~/mail/even-more-junk"
	word_file	= "~/mail/bin/ruby/words"

	word_length	= 3..16
	ignore_probs	= 0.3..0.7
	verbose_hdr	= true
	spam_threshold  = 0.8

Potential Problems/Bugs

Potential Improvements

--
Donald Craig (donald@cs.mun.ca)
Mon Mar 28 12:32:47 NST 2005