This page describes how to use the buryspam.rb script. For an informative and informal discussion of the practice and theory behind Bayesian spam filtering, please see Paul Graham's A Plan For Spam web page, which provided much of the inspiration for the implementation of this program.
Note that the script as well as this documentation is very much a work in progress. Please report any problems to donald@cs.mun.ca
The buryspam.rb script was developed using version 1.8.0 of ruby (Dated 2003-08-04). This version of ruby is currently installed on garfield. Other versions of ruby may or may not work as expected.
If the grad partition becomes inaccessible, mail delivery may be adversely affected if you call the script on the grad partition from your ~/.procmailrc file. procmail should try to recover messages that could not be processed due to a missing filter, but I've experienced mixed results in this respect.
To be safe, you may want to copy the the buryspam.rb script to your own directory instead of invoking it directly from /users/cs/grad/donald/pub/bin. This way, if the grad partition becomes inaccessible, your copy of the buryspam.rb script will still be accessible.
Note that if you use an e-mail client that transfers messages off of garfield (using, for example, POP or IMAP), then the buryspam.rb filtering script may not be appropriate for you. This is because the script relies on having a collection of good and bad messages stored on the server for initialization purposes. It may be possible to use buryspam.rb on the client side to initialize the filter but the file resulting from the initialization would have to be transferred manually to the server. This may be more trouble than it's worth. Also, note that the buryspam.rb filter requires that all messages be stored in Unix mbox format.
buryspam.rb is more suited to those who read their e-mail using client software directly on the mail server itself (e.g. pine, mutt, elm etc.). You also have to be willing to modify configuration files and your .procmailrc file.
good_dirs = "~/mail,~/mail/old" bad_dirs = "~/mail/junk" word_file = "~/mail/lib/words"Again, remember that the directory containing the word probability database file (~/mail/lib in the example above) must be created if it doesn't exist. The actual word file itself will be generated in the next step.
SHELL=/bin/sh PATH=/bin:/usr/bin LOGFILE=$HOME/mail/.procmail.log MAILDIR=$HOME/mail :0fw | /users/cs/grad/donald/pub/bin/buryspam.rb --filter :0: * ^X-Bayesian-Spam: Yes junk/spamNote that the junk directory specified above in the ~/.procmailrc file must be created in your ~/mail directory (if it doesn't already exist).
That's it. After this, all messages you receive will be sent through the bayesian filter. If it detects a spam message it will add the header line X-Bayesian-Spam: Yes to the message. Otherwise, the header line X-Bayesian-Spam: No is added. The ~/.procmailrc file will then deposit any messages with the spam header to the ~/mail/junk/spam folder and leave all your legitimate messages in your default mail spool. Extra header lines containing the actual bayesian value calculated for the message as well as the words that sparked the bayesian filter's "interest" can also be generated by setting the verbose_hdr parameter in your ~/.buryspamrc file to true.
The format of each line of the configuration file, ~/.buryspamrc, is:
parameter_name = parameter_value
Only the first three parameters given below compulsory, the remainder are optional and can be used to fine tune the filter. The defaults, many of which are from Paul Graham's web page described above, are usually sufficient.
The buryspam.rb script tries to be helpful in diagnosing errors in the rc file. If there are any errors in the configuration file, the script will simply pass any messages through unfiltered and then terminate. If the script is being run from your ~/.procmailrc file, errors will be logged to your ~/mail/.procmail.log file (provided LOGFILE is set appropriately in your ~/.procmailrc file).
The following options are supported by the buryspam.rb script in the ~/.buryspamrc configuration file.
good_dirs | |
Compulsory? | Yes |
Type | String |
Default | "" |
Description | Comma separated string of directories containing legitimate e-mail mboxes. The directories must be fully specified (i.e. no relative pathnames), but the ~ character may be used to represent a home directory. |
bad_dirs | |
Compulsory? | Yes |
Type | String |
Default | "" |
Description | Comma separated string of directories containing spam e-mail mboxes. The directories must be fully specified (i.e. no relative pathnames), but the ~ character may be used to represent a home directory. |
word_file | |
Compulsory? | Yes |
Type | String |
Default | "" |
Description | Full pathname of the word database to use. The ~ character may be used to represent a home directory. |
cache_dir | |
Compulsory? | No |
Type | String |
Default | "" |
Description | This parameter determines the directory for the cache files created during initialization of the filter. The full path name must be specified and the parent directory must exist. The ~ character may be used to represent a home directory. |
good_init_weight | |
Compulsory? | No |
Type | Integer |
Default | 2 |
Description | Word counts occurring in the messages in the good_dirs directory are multiplied by this amount during initialization of the filter. Setting this too high may result in more false negatives. |
bad_init_weight | |
Compulsory? | No |
Type | Integer |
Default | 1 |
Description | Word counts occurring in the messages in the bad_dirs directory are multiplied by this amount during initialization of the filter. Setting this too high may result in more false positives. |
good_select_weight | |
Compulsory? | No |
Type | Integer |
Default | 2 |
Description | During filtering, good and bad words are extracted in the same ratio as the good_select_weight to bad_select_weight (described below) ratio. This parameter biases the filter in favour of not treating messages as spam. Setting this too high may result in more false negatives. Setting this parameter to zero will turn off weighted selection. |
bad_select_weight | |
Compulsory? | No |
Type | Integer |
Default | 1 |
Description | During filtering, good and bad words are extracted in the same ratio as the good_select_weight (described above) to bad_select_weight ratio. This parameters biases the filter in favor of treating messages as spam. Setting this too high may result in more false positives. Setting this parameter to zero will turn off weighted selection. |
ignore_probs | |
Compulsory? | No |
Type | Float Range |
Default | 0.3..0.7 |
Description | Do not store words in the probability database whose probabilities lie in this range. For example, the default setting, 0.3..0.7, will cause the initialization routine to only write words to the database whose probabilities are between 0 and 0.3 or between 0.7 and 1. This may help speed up the loading of the word database during filtering as the word database will be (slightly) smaller. |
ignore_words | |
Compulsory? | No |
Type | Regexp |
Default | // |
Description | This parameter is used to determine which words to ignore during initialization. If a word matches this regular expression, then it is not written to the probability database. This parameter can be useful if spam starts to creep into a mailing list and you want to ignore words in the message's header that erroneously push the message into the non-spam category. By default, this parameter matches nothing. |
ignore_mboxes | |
Compulsory? | No |
Type | Regexp |
Default | // |
Description | During initialization, this parameter is used to determine which mailboxes to ignore when doing word counting and initial probability calculations. If the name of the mailbox matches the regular expression stored in this parameter, then the mailbox is ignore during initialization of the probability database. By default we do not ignore any mboxes. |
min_word_num | |
Compulsory? | No |
Type | Integer |
Default | 5 |
Description | This parameter determines how many times a word must be encountered before it can be considered for inclusion in the word database as either a good or bad word. This parameter takes into account the respective weightings above. If this parameter is set too small, the word database will grow larger, possibly negatively impacting run-time performance. If it is set too high, the number of words stored in the database will decrease, possibly negatively impacting accuracy. |
word_samples | |
Compulsory? | No |
Type | Integer |
Default | 15 |
Description | The number of words that the filter uses in calculating the probability that the message is spam. |
verbose_hdr | |
Compulsory? | No |
Type | Boolean |
Default | false |
Description | This parameter generates two extra header lines for each message that it processes. The X-Bayesian-Value: header line will indicate the probability that the message is spam and the X-Bayesian-Words: header line will contain the list of words that were used to judge the legitimacy of the message. |
spam_threshold | |
Compulsory? | No |
Type | Float |
Default | 0.9 |
Description | Any messages whose bayesian probability are greater than this number are classified as spam. Other messages are treated as non-spam. |
poison_threshold | |
Compulsory? | No |
Type | Float |
Default | 2.0 |
Description | If Bayesian analysis determined that the message was not spam, but the ratio of bad words to good words at the extrema is greater than this number, the message will be classified as spam. This should reduce the false negatives that occur when spammers deliberately attempt to poison filters by including a lot of superfluous words. When calculating the ratio, the remove_dups parameter is consulted to determine whether duplicate good and/or bad words should be counted when determining the ratio. Also, we require that the number of words used to calculate the ratio be greater than or equal to word_samples. To disable this feature, set poison_threshold to 0. |
bad_prob | |
Compulsory? | No |
Type | Float |
Default | 0.99 |
Description | This is the probability assigned a word that has been deemed to be an extremely strong indicator of spam. |
good_prob | |
Compulsory? | No |
Type | Float |
Default | 0.01 |
Description | This is the probability assigned a word that has been deemed to be an extremely strong indicator of non-spam. |
default_prob | |
Compulsory? | No |
Type | Float |
Default | 0.4 |
Description | This parameter determines what probability the filter assigns to words that it has not encountered before in incoming messages. By setting this parameter low, you are giving the message sender the benefit of the doubt. |
archive_file | |
Compulsory? | No |
Type | String |
Default | "" |
Description | The name of the file which should hold verbatim copies of all incoming messages. This parameter was useful only for debugging purposes and should generally not be used. |
word_length | |
Compulsory? | No |
Type | Integer Range |
Default | 3..20 |
Description | Words outside this range are not considered for inclusion in the word probability database. For example, with the default setting of 3..20, words that consist of only one or two characters or words that are greater than twenty characters are rejected for inclusion in the word database. |
word_regex | |
Compulsory? | No |
Type | Regexp |
Default | /[-A-Z0-9$'\x92\\_\[!]+/i |
Description | The regular expression used to determine what constitutes a word in a message. |
decode | |
Compulsory? | No |
Type | Regexp |
Default | /\.(te?xt|rtf|html?|scr|pif|exe|com|wpd|doc|xls|ppt|zip?)$/i |
Description | During initialization and filtering, the script will attempt to decode encoded message attachments for word extraction. This can increase the detection of spam messages which, for example, are base64 encoded. This parameter is a regular expression which is tested against the filename of each attachment. If it matches, the filename is decoded at word extraction of the decoded attachment will occur. |
max_decode_part | |
Compulsory? | No |
Type | Integer |
Default | 1048576 |
Description | Encoded message parts that are larger than this configuration setting are not decoded. The specified size is in bytes, so the default setting will prevent encoded message bodies that are larger than one megabyte from being decoded. To decode ALL message parts, regardless of size, set this parameter to -1. To decode no message parts, set it to 0. |
replace_8bit | |
Compulsory? | No |
Type | Boolean |
Default | false |
Description | When set to true, this parameter will replace 8bit characters in (nonspam) text messages with their corresponding 7bit-clean counterparts during filtering. Such 8bit characters (e.g. 0x92 instead of ') are sent by Microsoft Windows MUAs apparently unbeknownst to the sender. This seems to be occurring more frequently. |
spam_file_size | |
Compulsory? | No |
Type | Integer |
Default | 5242880 |
Description | When the spam file becomes greater than this size attempt to rename the spam file and let a new one get created. This parameter helps prevent spam files from becoming increasingly larger, thereby slowing down the initialization process. To prevent the renaming of the spam file, set this parameter to a value less than or equal to zero. The program attempts to determine the spam file based upon the contents of the ~/.procmailrc file. |
strip_html_comments | |
Compulsory? | No |
Type | String |
Default | limited |
Description |
This parameter determines how HTML comments in messages are handled
during initialization and filtering. Comments consist of text delimited
by <!-- and -->. This parameter may be set to
one of three modes:
|
strip_html_tags | |
Compulsory? | No |
Type | String |
Default | invalid |
Description |
During initialization and filtering, this parameter is consulted to
determine how to handle HTML tags in HTML messages. The parameter
can be set to one of three values:
|
strip_html_brackets | |
Compulsory? | No |
Type | Boolean |
Default | false |
Description | When an HTML tag or comment is stripped using the strip_html_tags and/or strip_html_comments configuration parameters respectively, this parameter is consulted to determine whether or not the <...> brackets are stripped too. This parameter is set to false by default, because spammers use HTML comments/tags to break up words and the resulting fragmented words can be very helpful in isolating spam. By preserving the brackets, we ensure that the words will remain broken up during tokenization of the message. |
remove_dups | |
Compulsory? | No |
Type | String |
Default | "none" |
Description |
During filtering, this parameter is used to determine how the
filter should handle duplicate words during the evaluation of the message.
The parameter can have four values:
|
epoch | |
Compulsory? | No |
Type | Date |
Default | January 1, 1970 |
Description | This parameter is used during automated diagnostic reporting and represents the start point of the filter testing. |
seed_interval | |
Compulsory? | No |
Type | Float |
Default | 200.0 |
Description | This parameter is used during automated diagnostic reporting and represents how many days to seed the filter during its first initialization. |
init_interval | |
Compulsory? | No |
Type | Float |
Default | 14.0 |
Description | This parameter is used during automated diagnostic reporting and represents the length of the initialization interval, in days, that should be used to initialize the filter starting from the first seeding period and for each test interval thereafter. |
test_interval | |
Compulsory? | No |
Type | Float |
Default | 14.0 |
Description | This parameter is used during automated diagnostic reporting and represents the length of the test interval, in days, that should be used to test the filter. Each test interval is added from the end of the initialization period determined using the init_interval parameter, above. |
slide_epoch | |
Compulsory? | No |
Type | Float |
Default | 0.0 |
Description | This parameter determines how many days the epoch should be moved ahead for each initialization during automated diagnostic generation. A value of 0.0 keeps the epoch constant. |
uniq_serial | |
Compulsory? | No |
Type | Integer |
Default | 4 |
Description | During report generation and the renaming of the spam file, serial numbers are used in order to ensure unique file generation. This parameter determines how many digits to use in the serial numbers. |
buryspam.rb <option> [mbox files]
For most (but not all) options, if no mboxes are specified on the command line, then input will be taken from stdin.
option can be specified as follows:
--initialize | |
Argument Options | none |
Short Form | -i |
Description | Generate the word probability file. The filter must be initialized once in order to be used. Subsequent initializations may take place at the user's discretion (e.g. whenever a message is misclassified). |
--filter | |
Argument Options | none |
Short Form | -f |
Description | Filter the messages using Bayesian filtering. This option implies decoding of the message. The Bayesian filter has to be initialized (using --initialize) before filtering can be performed. |
--decode-only | |
Argument Options | none |
Short Form | -d |
Description | Decode messages only. No filtering is done and the resulting message is displayed to standard output. Useful for debugging purposes. |
--colour | |
Argument Options | none |
Short Form | -c |
Description | Show the messages with the 'interesting' words colour-coded. Bad words are shown in varying degrees of red, while good words are shown with varying degrees of green. This argument implies decoding. |
--report | |
Argument Options | none |
Short Form | -r |
Description | Batch filter all messages then generate a summary report (in the file summary.rpt of the filtering results. False positive and false negatives encountered are stored in files false_positives.* and false_negatives.* (where * represents a unique serial number) for subsequent examination by the user. |
--grep | |
Argument Options | <regex> |
Short Form | -g |
Description | Display all messages that contain the pattern regex. The regular expression must be delimited by /.../ and i m and x options may be specified at the end of the regular expression. If no mbox files are specified on the command line, then search through all mbox files in the good_dirs and bad_dirs directories. This option does not actually decode base64/quoted-printable messages when doing the grep. |
--auto | |
Argument Options | none |
Short Form | -a |
Description | Conduct an automatic, iterative test of the filter to assess its effectiveness. See the ~/.buryspamrc configuration parameters epoch, seed_interval, init_interval, test_interval and slide_epoch for more information. |
--stats | |
Argument Options | none |
Short Form | -s |
Description | Display monthly volumes of good/bad mail as well as the percentage of spam received during each month. |
--verbose | |
Argument Options | none |
Short Form | -v |
Description | Display some diagnostic data during processing. This considerably clutters up the output. |
--begin | |
Argument Options | '[good|bad] <date|relative-time>' |
Short Form | -b |
Description |
For --initialize, initialize the filter using the good
or bad messages dated on or after date.
The good/bad parameter values are optional.
For --decode-only, --colour, --filter, --grep, --stats and --report, process messages dated on or after date.
A relative-time can be specified instead an absolute date.
For example, Often used with --end argument. Cannot be used with --auto argument. |
--end | |
Argument Options | '[good|bad] <date|relative-time>' |
Short Form | -e |
Description |
For --initialize, initialize the filter using the good
or bad messages dated upto but not including date.
The good/bad parameter values are optional.
For --decode-only, --colour, --filter, --grep, --stats and --report, process messages dated upto and including date.
A relative-time can be specified instead an absolute date.
For example, Often used with --begin argument. Cannot be used with --auto argument. |
--parameter | |
Argument Options | <attr val ...> |
Short Form | -p |
Description | Override the configuration parameters in ~/.buryspamrc using the supplied attribute/value pairs. Can be useful during automated testing. |
If standard input is available to the running script, then ALL options are ignored and the script runs in --filter mode.
Exactly one operating mode of --initialize, --decode-only, --colour, --filter, --grep, --stats, --report xor --auto MUST be specified.
If no valid mbox files are given on the command lines for the arguments --decode-only, --colour, --filter, --report or --grep then ALL the mboxes in the good/bad directories (as specified in ~/.buryspamrc) are used.
good_dirs = "~/mail,~/mail/old" bad_dirs = "~/mail/junk,~/mail/more-junk,~/mail/even-more-junk" word_file = "~/mail/bin/ruby/words" word_length = 3..16 ignore_probs = 0.3..0.7 verbose_hdr = true spam_threshold = 0.8
--
Donald Craig (donald@cs.mun.ca)
Mon Mar 28 12:32:47 NST 2005