`buryspam`
Version 2

Introduction
Requirements
Upgrading, Installation and Setup
Using IMAP
Misclassified Messages
Displaying Statistics
Changes from Version 1
Command line options
Configuration Parameters: ~/.buryspamrc
Potential Problems
Technical Documentation

Introduction

This page describes Version 2 of the buryspam Bayesian spam filter. Like the first version, this filter was influenced by Paul Graham's A Plan for Spam essay from August 2002. Some modifications to the algorithm have been made to help improve the effectiveness of the filter (see the Changes section). Informal testing suggests that this new filter may identify about 70% to 80% of the spam that was missed by the previous version of buryspam. This new version should also be faster, but requires more memory during initialization (about a gigabyte of memory should be sufficient).

The original buryspam script was intended to run be on mail servers via procmail. While Version 2 can still be used in this manner, it also contains a simple IMAP client which allows it to transfer messages via IMAP prior to filtering. See the Using IMAP section for more details. The script is pre-configured to download messages from MUN's IMAP server (mail.mun.ca).

Note that buryspam is not intended to be used directly by mail user agents (MUA) such as pine, Mozilla Thunderbird or Microsoft Outlook. Instead, it is configured to be used on MUN's computer systems as a mail filtering program invoked by procmail. It may also be used as a transfer/filter client which retrieves messages from a remote IMAP server and sorts spam and non-spam messages into predefined mboxes on the local machine. In both cases, an e-mail client, such as pine, can then be used to read the filtered messages.

Requirements

buryspam has the following requirements:

A relatively recent version of the ruby scripting language must be installed (the current version of buryspam was developed using version 1.8.6).
If you receive messages via a mail transfer agent (MTA) such as postfix or sendmail then procmail should also be installed on the machine and the MTA should be set up to invoke procmail automatically as messages are received. Modifications to your .procmailrc file will be necessary to call buryspam on received messages. If you intended to receive messages using buryspam's built-in IMAP client instead of an MTA, then there is no need to have procmail installed.
You must also have a collection of recent spam and non-spam messages for initialization purposes — several hundred of each may be sufficient to get you started. To increase the filter's accuracy, it can be reinitialized later as more spam/non-spam messages are received. The filter requires direct access to these messages for initialization. These messages will typically be stored in the mail directory of your CS account.
You should be comfortable using the command line and modifying configuration files manually. If you aren't, then buryspam is probably not for you.

Upgrading, Installation and Setup

Upgrading
If upgrading from a previous version of buryspam you should temporarily comment out any rules in your ~/.procmailrc that invoke the buryspam filter by placing a # character at the beginning of the lines. In particular, rules of the form:
```
:0fw
| $HOME/bin/buryspam.rb --filter
```
should be disabled while the new filter is being setup. This will mean, of course, that spam will not be filtered during setup of the new filter. The lines will be uncommented by a subsequent step.

It may also be a good idea to make backups of your buryspam.rb and ~/.buryspamrc files and your word file database before continuing so they can be restored if something goes wrong. Note that the word database file and cache files generated by the first version are incompatible with the corresponding files used by the new version of buryspam.
Copy the buryspam.rb file to a subdirectory of your home directory.
The buryspam script currently resides in the ~donald/pub/bin directory. This should be available from any machine with access to the grad partition. You should have your own copy of the script in case the grad partition becomes unavailable for some reason.
```
$ cp ~donald/pub/bin/buryspam.rb ~/bin/buryspam.rb
```
Separate your spam and non-spam mail

The buryspam script needs a collection of both spam and non-spam messages for initialization purposes. Separate your spam and non-spam messages into different directories. The spam and non-spam may be spread across several directories and subdirectories, but each directory should contain only spam or non-spam messages. Each file containing e-mail messages must be in mbox format. Files that do not appear to be in mbox format will be ignored during initialization.
Setup your ~/.buryspamrc configuration file.

At minimum, your ~/.buryspamrc file must set three configuration parameters: bad_dirs and good_dirs, each of which is a list of directories separated by commas that hold the spam and non-spam messages, respectively; and word_file which is the full pathname of the file which will be used to store word probabilities and other information necessary for filtering messages. All directories should be absolute (not relative) and may use the tilde (~) to represent your home directory. e.g.,
```
bad_dirs  = "~/mail/junk"
good_dirs = "/users/cs/grad/donald/mail, ~/mail/old"
word_file = "~/mail/lib/buryspam.data"
```
Note that these configuration parameters are the same as the previous version of buryspam. Because other files and subdirectories (e.g., lock files, cache and log files) are stored in the same directory as the word file, the word file should be located in an empty directory. This directory will be created if it did not previously exist.

Many other configuration parameters are also supported for fine-tuning the filter. Two configuration parameters that you may want to set are missed_spam_file and spam_file.
Initialize the filter

To initialize the filter, simply give the command:
```
$ ruby ~/bin/buryspam.rb --init
```
This will scan over the good/bad directories indiciated by the ~/.buryspamrc configuration file and process each of the mboxes in these directories. This may take several minutes to complete. After it is finished, the Bayesian database will be written to the word_file parameter you specified in your configuration file.

By default, buryspam will use the last three years of good messages and the last one year of bad messages to initialize the filter. If this takes too long to initialize or if buryspam runs out of memory during initialization, the time periods can be shortened by using the bad_init_date_range and/or good_init_date_range configuration parameters in your ~/.buryspamrc file. e.g.,
```
# Use the last three months of spam for initialization.
bad_init_date_range   = 3m

# Use good messages received during the last year for initialization.
good_init_date_range  = 1y
```
Setup your ~/.procmailrc file

In order to execute the buryspam filter, it must be invoked by a rule in your ~/.procmailrc file. You can now uncomment the lines that you commented out in Step 1 if you are upgrading from a previous version of buryspam, but please read the rest of this step to ensure that your .procmailrc file is setup correctly.

If you are using buryspam for the first time, here is a sample ~/.procmailrc file:
```
SHELL=/bin/sh
PATH=/bin:/usr/bin
MAILDIR=$HOME/mail
LOGFILE=$MAILDIR/.procmail.log

# Replace 'username' with your login name.
:0fw:/tmp/username.lock
| $HOME/bin/buryspam.rb --filter

:0:
* ^X-Buryspam-Spam: Yes
junk/spam
```
Replace username with your login name (or some other text for the lockfile name). Also make sure that you specify the correct path of your copy of the buryspam.rb script.

This file contains two rules. The first rule calls the buryspam filter which will place a X-Buryspam-Spam header line on messages as part of the filtering process. The second rule uses this header line to determine if the message should be stored in the spam folder of the junk directory. The junk directory is relative to the MAILDIR directory specified near the beginning of the ~/.procmailrc file and must already exist. The directory that stores all your spam (~/mail/junk in this case) should be one of the directories specified by the bad_dirs configuration parameter. A .procmail.log file is created in your MAILDIR directory by procmail. This file may contain helpful diagnostic messages if buryspam fails for some reason. Other rules to automatically sort messages from other sources (e.g., mailing lists, NEWSLINE announcements etc.) can also be added to your .procmailrc file.

Note: The previous version of buryspam added a X-Bayesian-Spam header to messages. This has been replaced with the X-Buryspam-Spam header line in version 2.

At this point, all messages that you receive should pass through the buryspam filter. Any spam messages should be automatically deposited in your mail/junk/spam file. The rest of your messages should be stored according the the remainder of the rules in your .procmailrc file.

Using IMAP

If your messages are stored on a remote mail server, buryspam can still be used to transfer and filter these messages. The messages are converted to mbox format and stored in a file on your local file system. Note that after a message has been successfully transferred/filtered, it is deleted from the IMAP server. To transfer and filter messages from an IMAP server, use the --transfer command line option:

$ ruby buryspam.rb --transfer

After prompting you for your IMAP password, this will connect to the configured IMAP server and call procmail to process and store each message as they are transferred from the IMAP server. If your ~/.procmailrc file is set up as specified in the previous section, then this will result in another invocation of a buryspam process to filter each message transferred. Because two processes are created per message, this method of transferring/filtering messages is relatively slow. This should not be a problem if you transfer messages regularly, thereby preventing a large number of messages from accumulating on the server.

To automatically transfer messages at a regular time interval, the --poll command line option can be used:

$ ruby buryspam.rb --poll

This is identical to the --transfer option except that instead of terminating after transferring and filter messages, buryspam will remain running in the background (as a daemon) and periodically connect to the IMAP server to transfer messages and filter them using procmail. See the poll_timer configuration variable for information on how to set the time interval.

If you want to transfer several hundred or more messages from the IMAP server, then the --bulk command line option will likely be considerably faster:

$ ruby buryspam.rb --bulk

This will transfer and filter the messages from the IMAP server within a single buryspam process. All non-spam messages will be saved to your spool file as indicated by the mail_spool configuration parameter and all spam will be saved to the file specified by the spam_file configuration variable. Because your ~/.procmailrc file is ignored, no other sorting of messages will occur. This option is recommended if you don't have a .procmailrc file or if you have a very simple .procmailrc file which does no other sorting of messages beyond depositing spam into a spam folder.

Bulk filtering will be used if procmail is not found on your system by the configuration, even if the --transfer or --poll options are given on the command line.

There are several configuration parameters related to transferring and filtering messages via IMAP, including: imap_client, imap_inbox, imap_port, imap_server, imap_use_ssl, imap_username, mail_spool, poll_timer, procmail and spam_file

Misclassified Messages

False Positives

As the filter is used, false positives (non-spam messages incorrectly identified as spam) may occur, but they should be very rare. If you haven't received a message you were expecting, then the message may have been identified as a spam message and sent to your spam folder. Check the spam files in the directories specified by the bad_dirs configuration parameter. You can use the --grep option of buryspam to search spam files for regular expressions to help making finding false positives easier: e.g.,

$ ruby buryspam.rb --grep '/Name|Thing/' spam* |less

Non-spam messages that have been incorrectly identified as spam should be moved out of the spam directory and into one of the good_dirs directories. Note that spam files are automatically renamed before they grow beyond a certain size (see the spam_file_size configuration parameter), so there may be many spam files.

False Negatives

If the spam filter incorrectly identifies a spam message as non-spam (false negative), don't delete the message. Instead, either save it to a file in the spam messages directory or save it to a file which will be moved to a spam message directory during a subsequent initialization of the filter with --init. (see the missed_spam_file configuration parameter). If the filter is repeatedly misclassifying messages that have the same content, then the filter should probably be reinitialized.

Leaving spam messages in a directory specified in the good_dirs directory list or leaving non-spam messages in a directory specified by the bad_dirs configuration parameter may decrease the effectiveness of the filter the next time it is initialized.

If the filter seems to be generating a lot of false positives or false negatives, then it is possible that you may have initialized the filter with some spam and/or non-spam messages in the wrong directories. Turn on the verbose_hdr configuration parameter and wait for subsequent misclassifications to occur. Then study the X-Buryspam-Words: header lines of messages which have been misclassified to see what discriminating words (and probabilities) the filter used to classify the message. If a lot of words with high probabilities are appearing in your non-spam messages or if a lot words with low probabilities are appearing in your spam messages, then you can --grep for these words in the messages that were used to initialize the filter to determine if your initialization corpus was tainted.

Troubleshooting and Log files.

If there are any problems, you should check your ~/mail/.procmail.log file (specified in the .procmailrc file, above). You can also check the log files in the buryspam log directory. Searching for the log files for lines that begin with E may be helpful.

By default, the log directory is named log and is located in the same directory as the word_file. More information can be generated in the log file by setting the log_level configuration variable to debug.

If buryspam encounters a message that it has trouble processing from the IMAP server, any messages following the troublesome message may not get processed. As a result, buryspam will become wedged and will not download any messages from the IMAP server. To fix this problem, it may be necessary to login to the IMAP server directly (via a webmail interface, for example) and delete the message that is causing the problem.

Displaying Statistics

Various statistics about the rate and amount of spam you receive as well as the filter's effectiveness in blocking it can be obtained using the --stats option.

$ ruby buryspam.rb --stats

Buryspam v2.0.0 (hostname)
Statistics...

Last reinitialized:   2008-10-27 20:08:52 (3.8 days ago)
Unprocessed messages: 0
Total spams:          3657
Spam frequency:       971.2 spams/day
Spam period:          89.0 secs/spam
False negatives:      13
False negatives/day:  3.5
Accuracy:             99.64%

If it has been a long time since the filter was initialized, then this command may take a while to run — be patient.

Changes from Version 1

Although buryspam has been completely rewritten, its basic usage hasn't been changed considerably in Version 2. The filter is initialized and used the same way as the previous version. Internally, the code is tighter and in some cases faster than the previous version. Some of the major changes include:

When selecting discriminating word samples during filtering, proportional subsampling is used when the number of samples exceeds the sample size. This should reduce poisoned spam messages that get through the filter by including a lot of random words.
The filter maintains lists of IP address from which non-spam and spam messages are received. When a new message is received its octets are compared with the octets of IP address from which spam was received. Each matching octet causes a good word to be shifted out of the discriminating word samples and a fictional bad "word" to be introduced into the word samples.
There are now two cache files for each mbox. Also, the cache files are comprised of serialized hashes and store more information. This should speed up initialization.
In order to conserve memory, the previous version of buryspam would only load parts of an mbox into memory at one time during initialization. This version loads entire mboxes into memory as they are processed. This simplifies the code considerably but requires more memory than the previous version of buryspam.
In the previous version, new messages had to be filtered by calling buryspam from a rule in the .promailrc file. Now, messages can be retrieved from an IMAP server and subsequenty filtered. Periodically polling for messages on an IMAP server can also be performed by running buryspam as a daemon. See the Using IMAP section for more details.
Configuration parameters and command line options of questionable usefulness have been eliminated, particularly those related to HTML message parsing and to automated testing and report generation.
A rotating log file (provided by ruby's Logger library) is now used which records actions taken by buryspam as it runs. This log file may be useful in helping to diagnose problems encountered by buryspam. New configuration parameters related to logging are: log_file, log_file_count, log_file_size and log_level
More comprehensive help is available with the --help command line option. When given a command line option or a configuration parameter as an argument to the --help option, documentation related to that command line option or configuration parameter is displayed. For example:
```
$ ruby burypam.rb --help word_file
```
will display help on the word_file configuration parameter. HTML documentation for all the command line options and configuration parameters can also be generated.
This version of buryspam now makes use of lockfiles to prevent unsynchronized access to files and to prevent two related buryspam processes from being run simultaneously.
The probabilities assigned to words has their precision limited to three digits after the decimal, by default; but this can be changed via the precision parameter.

Command line options

The available command line options are displayed in the usage message which is provided when the script is executed without any command line options.

$ ruby buryspam.rb

Buryspam v2.2.1 (garfield)
Help...

Usage: buryspam.rb <option> [<mboxes>]

Available command line option modes:

--init                 Initialize the word probability and IP address database.
--filter               Filter mboxes, testing for spam.
--transfer             Transfer/filter mail from IMAP server using procmail.
--poll                 Do --transfer periodically.
--bulk                 Transfer/filter mail from IMAP server without procmail.
--decode               Decode messages in mboxes (i.e., base64/qp encodings).
--grep '/pattern/'     Display messages that contain 'pattern' (implies decode).
--colour               Assign probability colours (in HTML) to messages' words.
--stats                Display various statistics regarding filter performance.
--log [severity]       Display log messages with specified severity (d,i,w,e,f).
--help [option|param]  Display this help message or display help for the
                       specified command line option or configuration parameter.

For --filter, --decode and --grep, mbox input is taken from either stdin
or from command line filenames.  The other modes do not use mboxes.
NOTE: Only one option mode can be specified at a time.

Other options:

--override <params>    Override configuration parameters.

--override can be used in conjunction with the other modes above.
<params> is string of the form: 'param = value; param = value; ...'
Run 'ruby buryspam.rb --help override' for more details

A more detailed description of each of the command line options is given below:

Option:	--bulk (-b)
Arguments:	None
Description:	Like `--transfer` and `--poll`, this option requests your IMAP password and connects to the IMAP server to transfer all messages from the IMAP server's inbox to the local system. Messages are deleted from the IMAP server as they are transferred and filtered successfully. Unlike `--transfer` (and `--poll`), `procmail` is not used to filter the messages. Instead, all filtering is done in-process. While this significantly speeds up the filtering of messages as they are transferred from the IMAP server, filtering is limited to storing non-spam messages in your mail spool and storing all spam messages in the spam file determined the `spam_file` configuration parameter. Any filtering normally done by `procmail` and your `.procmailrc` will not be performed. If the `procmail` executable is not found on your system, bulk filtering will be performed even if `--transfer` or `--poll` is specified on the command line. This option cannot be used if another `buryspam` process is running with the `--transfer`, `--poll` or `--bulk` option. To use this option, you must run `buryspam` on the machine specified by the `imap_client` configuration parameter.

Option:	--colour (-c)
Arguments:	None
Description:	Decode the given messages and colourize the words according to their bayesian probabilities. Good words are various shades of green and bad words are assigned various shades of red. The further the probability from 0.5, the more intense the colour. Colours that match the `good_prob` or `bad_prob` values are assigned a blinking attribute in the generated HTML output.

Option:	--decode (-d)
Arguments:	None
Description:	Decode the messages in the supplied mboxes. Any base64 and quoted-printable message parts will be decoded and displayed along with the rest of the message. Message parts whose `Content-type` matches the `undecodable` configuration parameter will not be decoded. The mboxes will be read either from standard input or from the file names specified on the command line. This mode is used mainly for debugging purposes.

Option:	--filter (-f)
Arguments:	None
Description:	Use Bayesian analysis to filter the messages in the supplied mboxes. The Bayesian database must first have been initialized using `--init`. The mboxes will be read either from standard input or from the file names specified on the command line. This mode is typically when using buryspam as the filter in the `.procmailrc` file.

Option:	--grep (-g)
Arguments:	Required
Description:	Search the given mboxes for messages that contain the given regular expression argument. The regular expression must be enclosed by slashes (`/.../`). Case insensitive searches may be performed by appending an "`i`" after the last slash. The regular expression should also be enclosed by `'...'` to prevent the shell from mangling it. For example: `$ buryspam.rb -g '/search (text\|string)/i' mbox_file` Messages are decoded before being searched. This implies that `undecodable` message parts will be removed from the results. Message headers and bodies are both searched. To search for strings that may be broken across lines, replace spaces in the search string with `\s+`. Searching may take a long time if many large mboxes are specified on the command line.

Option:	--help (-h)
Arguments:	Optional
Description:	Display a help message that briefly describes all the command line options of `buryspam` to stdout. If you supply the name of a configuration parameter or a command line option (without the leading `--`), a related help message will be displayed. Redirecting the output to a file or a pipe will result in the help message being converted to an HTML. Supplying an invalid command line option or configuration parameter will show you a list of all command line options and configuration parameters. If the argument to this option is '`*`' (this should be quoted to prevent shell expansion), then help will be generated for all configuration parameters and command line options in HTML format. This is can be used to generate documentation for web pages.

Option:	--init (-i)
Arguments:	None
Description:	Initialize the Bayesian database by scanning all the directories that contain good/bad mbox files (see `good_dirs` and `bad_dirs`) and assigning Bayesian probabilities to all the words. Also extract the IP addresses from which both non-spam and spam messages were sent. The words, probabilities and IP addresses are stored in the Bayesian database which will be consulted during subsequent filtering of new messages. The initialization process may take several minutes the first time, but should be faster during subsequent initializations because of caching.

Option:	--log (-l)
Arguments:	Optional
Description:	Show log messages. The optional argument is one or more of the following letters representing the severity of the log messages to display: `d` (debug), `i` (info), `w`, (warn), `e` (error), `f` (fatal). By default, only (e)rror and (f)atal messages are displayed.

Option:	--override (-o)
Arguments:	Required
Description:	Override the configuration parameters in `~/.buryspamrc` using a string of parameter/value pairs argument. The string argument must be of the form: `'param1 = value1; param2 = value2, ...'` (To inhibit shell interpretation, you may want to enclose the entire argument with quotes.) This option allows one to temporarily change the configuration of `buryspam` without having to actually modify the `~/.buryspamrc` file. For example, the Bayesian database can be initialized faster by turning off logging, as follows: `$ buryspam.rb --init -o log_file=` To transfer messages from a different IMAP server, try: `$ buryspam.rb -t -o 'imap_server = mail.server.com; imap_username=jdoe'`

Option:	--poll (-p)
Arguments:	None
Description:	After obtaining the password for the IMAP account, this option will periodically connect to the IMAP server to transfer messages to the local system and filter them. Messages are deleted from the server as they are transferred and filtered. Like the `--transfer` option, filtering will be done by `procmail`, if it is available, in accordance with the rules in your `.procmailrc` file. If `procmail` is not available, then filtering will be performed as described by the `--bulk` option. When polling has been started, you cannot re-run `buryspam` using the `--poll` `--transfer` or `--bulk` options. The `buryspam` process running with the `--poll` option must be terminated (via `kill`, for example) first. As with `--transfer` and `--bulk`, when using the `--poll` option, you must run `buryspam` on the machine specified by the `imap_client` configruation parameter.

Option:	--stats (-s)
Arguments:	None
Description:	Display statistics regarding the performance of the Bayesian filter since it was last intialized. Stats related to the number of spam messages received as well as the frequency and period are displayed. Postmarks of any spam messages dated after the initialization that were not processed by buryspam are listed. The accuracy of the filter in terms of percentage of spam messages identified is also displayed. This information can be used to determine when the filter should be reinitialized. Generation of statistics could take a long time if it has been a while since the spam filter was initialized.

Option:	--transfer (-t)
Arguments:	None
Description:	This option will query for your IMAP password and connect to the IMAP server to transfer and filter messages. Messages are deleted from the server as they are transferred and filtered. If `procmail` is available, then it will be used to filter messages after they are transferred from the server in accordance with the rules in your `.procmailrc` file. Note that if there are a lot of messages on the server, it may take a long time transfer and filter all the messages. Consider using the `--bulk` option in this case. If `procmail` is not available, bulk filtering will be employed: all spam messages will be saved to the spam file determine the `spam_file` configuration parameter and all non-spam messages will be saved to your mail spool. (Any rules in your `.procmailrc` file are ignored.) You must run `--transfer` requests on the same machine as that specified by the `imap_client` configuration parameter. This option cannot be used if another `buryspam` process using the `--poll` or `--bulk` option is currently running.

Configuration Parameters: `~/.buryspamrc`

Configuration parameters are stored in the ~/.buryspamrc configuration file which is read during startup. The syntax for each line is:

parameter_name = value

Space and tabs are ignored during parsing. Lines beginning with '#' are ignored. Leading and trailing quotes around the value are ignored. The buryspam.rb script will terminate prematurely if there are any errors in the ~/.buryspamrc file. Configuration parameters may be overridden on the command line with the --override command line option. A complete description of all available configuration parameters is given below. Note that the word_file, bad_dirs and good_dirs configuration parameters must be specified.

Parameter:	archive_file
Type:	file
Default:	""
Description:	All messages received by `buryspam` are stored in this file prior to filtering. If pathname isn't absolute, then it will be relative to the directory of the `word_file` parameter. If this is set to an empty string, then archiving is disabled.

Parameter:	archive_file_size
Type:	size
Default:	0
Description:	The maximum size of the archive file. If a new message pushes the archive file beyond this size, the archive file will be renamed with a current date suffix and a new archive file will be created. If the size is less than or equal to zero, then archiving is disabled. Values for this parameter can either be an integer or an integer followed by an IEC/SI suffix: `KB`, `MB`, `GB`, `KiB`, `MiB` and `GiB`. e.g., `20MB` and `1GiB`.

Parameter:	bad_dirs(Compulsory)
Type:	dirs
Default:	""
Description:	This is a list of comma separated directory names than contain spam mboxes to be read when initializing the filter. The directory names may begin with `"~"` to represent a home directory. Directory names should be absolute. (Nonabsolute directory names are assumed to be relative to the `word_file` directory.) This parameter must be specified.

Parameter:	bad_init_date_range
Type:	date_rng
Default:	"1y"
Description:	The Bayesian filter will be initialized using all spam messages that occurred over this specified range. If the range is `""` (empty string), then all spam messages in the `bad_dirs` directories will be used for initialization. Date ranges are typically specified with a number followed by a unit of time (`d` = days, `w` = weeks, `m` = months, `y` = years). e.g., `3m` will examine all messages received during the past three months during initialization. Other date ranges are possible. See the DateRange class in the technical documentation below for details.

Parameter:	bad_init_weight
Type:	int
Default:	1
Description:	Specifies the factor to multiply the counts of all words encountered in spam messages when calculating each of the words' probabilities. Increasing this value may reduce false negatives.

Parameter:	bad_prob
Type:	float
Default:	0.999
Restrictions:	must be between 0.0 and 1.0
Description:	The maximum bad probability that can be assigned to a word.

Parameter:	cache_dir
Type:	dir
Default:	"cache"
Restrictions:	cannot be blank
Description:	All the cache files used during initialization of the Bayesian database are stored in this directory. The directory structure of the cache directory will mirror the directory structure of the `good_dirs` and `bad_dirs` directories. If a full pathname is not given `cache_dir` will be relative to the `word_file` directory

Parameter:	default_prob
Type:	float
Default:	0.5
Restrictions:	must be between 0.0 and 1.0, inclusive
Description:	The default probability to assign to words that do not have a probability assigned to them during filtering. This is used when a message does not contain enough discriminating words.

Parameter:	filter_date_range
Type:	date_rng
Default:	""
Description:	If a message is not within this date range, then don't filter it — just pass it through without adding any spam headers. An empty string will cause all messages to be processed.

Parameter:	fwd_from
Type:	str
Default:	""
Restrictions:	must contain an '@' symbol
Description:	The email address from which forwarded messages should appear to originate. If specified, the address must contain an `@` symbol. This address will be placed in the `Return-Path:` header line when the message is received at the forwarded destination and will be used as the recipient of any bounce messages should the forwarding be unsuccessful.

Parameter:	fwd_inhibit
Type:	regex
Default:	""
Description:	Don't forward messages that are filtered to a folder matching the regular expression assigned to this variable. The regex is matched against the folder name saved in the procmail log file.

Parameter:	fwd_smtp_port
Type:	int
Default:	25
Description:	The port of the server used for forwarding email as they are received from the IMAP server. See `fwd_smtp_server`.

Parameter:	fwd_smtp_server
Type:	str
Default:	"smtp.mun.ca"
Description:	The hostname of the machine which is to be used for forwarding email as they are received from the IMAP server. This server is used as a last resort should attempts to deliver messages to the MX server(s) of the `fwd_to` domain fail.

Parameter:	fwd_to
Type:	str
Default:	""
Restrictions:	must contain an '@' symbol
Description:	The email address to which all messages should be forwarded as they are received from the IMAP server. Leave it empty if forwarding is not required. If specified, the address must contain an `@` symbol. If this parameter is empty, then no forwarding is performed. Both spam and non-spam messages are forwarded to the address, if set. If an unrecoverable exception occurs during the forwarding to the remote machine, the message is left on the IMAP server without being transferred to the local machine. If forwarding is successful, then the message is filtered and stored on the local machine, and deleted from the IMAP server, as usual.

Parameter:	good_dirs(Compulsory)
Type:	dirs
Default:	""
Description:	This is a list of comma separated directory names than contain non-spam mboxes to be used when initializing the filter. The directory names may begin with `"~"` to represent a home directory. Directory names should be absolute. (Nonabsolute directory names are assumed to be relative to the `word_file` directory.) This parameter must be specified.

Parameter:	good_init_date_range
Type:	date_rng
Default:	""
Description:	The Bayesian filter will be initialized using all non-spam messages that occurred over this specified range. If the range is `""` all non-spam messages in the `good_dirs` directories will be used for initialization. See the `bad_init_date_range` parameter for the format of a date range.

Parameter:	good_init_weight
Type:	int
Default:	1
Description:	Specifies the factor to multiply the counts of all words encountered in non-spam messages when calculating each of the words' probabilities. Increasing this value may reduce false positives.

Parameter:	good_prob
Type:	float
Default:	0.001
Restrictions:	must be between 0.0 and 1.0
Description:	The minimum good probability that can be assigned to a word.

Parameter:	ignore_mboxes
Type:	regex
Default:	""
Description:	All spam/non-spam mboxes with that match this regular expression are ignored during initialization. The full pathname of each mbox is tested against this regular expression. A regular expression of `//` causes all mbox files to be ignored; while `""` will cause no mbox file to be ignored.

Parameter:	ignore_probs
Type:	float_rng
Default:	0.3..0.7
Restrictions:	both ends of the range must be between 0.0 and 1.0
Description:	Words whose probabilities lie in this range are not stored in the Bayesian database during initialization. If the Bayesian database (`word_file`) gets too large, try increasing this range.

Parameter:	ignore_words
Type:	regex
Default:	/^(Jan\|Feb\|Mar\|Apr\|May\|Jun\|Jul\|Aug\|Sep\|Oct\|Nov\|Dec\|N[DS]T\|-?\d{4})$\|:/x
Restrictions:	Cannot be //
Description:	All good/bad words that match this regular expression are ignored during initialization and not stored in the Bayesian database. Note that by default, we ignore month names and local timezones, since they could affect the accuracy of the filter, especially if less than a year of spam/non-spam is used to initialize the filter. Also, we ignore all words that have a colon since message headers and CSS parameters in HTML message bodies can mislead the filter. Setting this parameter to `""` will cause all words to be used during initialization (i.e., no words will be ignored).

Parameter:	imap_client
Type:	str
Default:	"garfield"
Description:	The name of the machine from which IMAP transfers should take place. Specify the name of the machine on which your default mail spool resides. This prevents you from accidentally transferring non-spam messages to another machine's mail spool. If all your incoming mail is redirected (by `procmail`, for example) to a inbox folder accessible from any machine (e.g., somewhere in your home directory) or if you wish to connect to the IMAP server from any machine, set this parameter to `""`.

Parameter:	imap_inbox
Type:	str
Default:	"inbox"
Description:	The name of the IMAP inbox from which to retrieve messages during transfer/polling.

Parameter:	imap_port
Type:	int
Default:	993
Description:	The port of the IMAP server from which to retrieve messages during transfer/polling. If TLS/SSL encryption not supported on the IMAP server, then change this to `143`.

Parameter:	imap_server
Type:	str
Default:	"mail.mun.ca"
Description:	The name of the IMAP server from which to retrieve messages during transfer/polling.

Parameter:	imap_timeout
Type:	int
Default:	20
Restrictions:	must be greater than zero
Description:	The length of time to wait while waiting for an IMAP connection, in seconds.

Parameter:	imap_use_ssl
Type:	boolean
Default:	true
Description:	If `true`, use TLS/SSL when connecting to the IMAP server. Otherwise, no encryption will be used.

Parameter:	imap_username
Type:	str
Default:	ENV['LOGNAME']
Description:	The username to use when connecting to the IMAP server.

Parameter:	lock_timeout
Type:	int
Default:	30
Restrictions:	must be greater than zero
Description:	The maximum amount of time in seconds to wait for a lock on a file and the maximum amount of time that a lock can be held (unless a non-blocking lock is requested done). It's probably best to leave this parameter alone.

Parameter:	log_file
Type:	file
Default:	"log/buryspam.log"
Description:	The name of the log file to which diagnostic messages are written during execution. The log file will be relative to the Bayesian database directory (see `word_file`) if the given file name is not absolute. The log files will be rotated as they reach a certain size (specified by the `log_file_size` configuration parameter). To disable logging, set this parameter to `""` (empty string).

Parameter:	log_file_count
Type:	int
Default:	10
Description:	The number of log files to keep around. Log files are rotated as they reach their maximum size. Increase this parameter to prevent older log files from being deleted.

Parameter:	log_file_size
Type:	size
Default:	"1MiB"
Description:	The maximum size of the log file. If a log file becomes larger it will be rotated by the `Logger` class. See `archive_file_size` for the format of the size parameter.

Parameter:	log_level
Type:	level
Default:	"info"
Description:	The logging threshold. The threshold can be (in increasing order) `debug`, `info`, `warn`, `error` or `fatal`. Diagnostic messages whose levels are at or above the specified threshold will be written to the log file.

Parameter:	mail_spool
Type:	file
Default:	ENV['MAIL']
Description:	The name of the local spool file. Non-spam messages are stored in this file during `--bulk` IMAP message transfers. If `procmail` is not available, this file is also used to store non-spam messages during `--transfer` and `--poll` IMAP connections. The file should be specified with an absolute pathname. A non-absolute spool file is assumed to be relative to the `word_file` directory.

Parameter:	max_msg_transfer
Type:	int
Default:	-1
Description:	The maximum number of messages to transfer from the IMAP server during a single IMAP session. Other messages will be left on the IMAP server until the next reconnection. Parameter is used on the right-side of a range operator, so -1 (default) implies transfer all messages. Applies to modes `--transfer`, `--bulk` and `--poll` modes.

Parameter:	min_disk_free
Type:	size
Default:	"1MiB"
Description:	The minimum amount of disk space that should be free before proceeding. Anything less than this will cause termination immediately after startup. Fetching messages during polling will not occur if there is not sufficient disk space available. See `archive_file_size` for the format of this parameter.

Parameter:	min_word_num
Type:	int
Default:	5
Restrictions:	must be greater than or equal to zero
Description:	The minimum number of times a word must be encountered before being stored in the Bayesian database during initialization. Bad/Good words are multiplied by their respective weights before this test is performed.

Parameter:	missed_spam_base
Type:	str
Default:	"missed-spam"
Description:	The destination base name to use when moving a file containing spam not caught by the filter (see `missed_spam_file`). Note: `missed_spam_base` is a string, not a file — during initialization the missed spam file will be moved to the first directory in the `bad_dirs` directory list.

Parameter:	missed_spam_file
Type:	file
Default:	""
Description:	Any spam messages that were missed by the filter can be saved to this file using a mail user agent. During (re)initialization of the filter, this file will automatically be moved to the first spam directory (indicated by the `bad_dirs` directory list) with an appropriate date suffix appended to its base name (see `missed_spam_base`). The pathname should be absolute. If it isn't, then it is assumed to be located in the `word_file` directory

Parameter:	num_word_samples
Type:	int
Default:	15
Restrictions:	must be greater than zero
Description:	Number of discriminating word samples to extract from a message during filtering. Discriminating words are those whose probabilities are furtherest from 0.5 in the Bayesian database: good (non-spam) words are closer to zero and bad words (spam) are closer to one. If the number of discriminating word samples exceeds this parameter's value, then good/bad words are proportionally selected to satisfy the sample selection.

Parameter:	poll_timer
Type:	timer
Default:	"15min"
Restrictions:	must be greater than zero
Description:	During `--poll` mode, the `poll_timer` configuration parameter specifies how long to wait between checking the IMAP server for new messages. The same length of time will be used between retries if the IMAP host is unreachable during connects. Format is an integer followed by "hr" or "min" e.g., `"20min"`.

Parameter:	precision
Type:	int
Default:	3
Restrictions:	must be greater than zero
Description:	The number of digits after the decimal place to use/store when processing floating point number probabilities during initialization and filtering.

Parameter:	procmail
Type:	file
Default:	'/usr/bin/procmail'
Description:	The full pathname of the `procmail` executable. The `procmail` executable is used to filter mail that is received via IMAP during the `--transfer` and `--poll` modes. If this parameter is set to an empty string or to a non-executable filename, then procmail will not be used during filtering of messages. This means that all messages will essentially be `--bulk` filtered, with spam going to the spam file specified by the `spam_file` configuration parameter, non-spam being stored in the spool and no other filtering taking place. Note that the full path must end in '`procmail`', otherwise error recovery in the event of misconfiguration may not be successful.

Parameter:	spam_file
Type:	file
Default:	""
Description:	This is the name of the file to which all spam is stored during `--bulk` IMAP message transfers. The full pathname of the spam file should be specified; otherwise, it will be relative to the `word_file` directory. This spam file is also used for `--transfer` and `--poll` IMAP message transfers if the `procmail` configuration parameter is set to a non-executable.

Parameter:	spam_file_size
Type:	size
Default:	"10MiB"
Description:	The maximum size of a spam file (the name of the spam file is deduced from the `.procmailrc` file). If a new spam message would cause this maximum size to be exceeded, the spam file is renamed with a date suffix and a new spam file will be created by `procmail`. If size is less than or equal to zero, each spam will be saved in it's own file by `procmail`. See `archive_file_size` for the format of the size parameter.

Parameter:	spam_threshold
Type:	float
Default:	0.5
Restrictions:	must be between 0.0 and 1.0, inclusive
Description:	If the combined Bayesian probability of a message's word samples exceeds this probability, then the message is spam. Otherwise, it is non-spam.

Parameter:	test_msg_urls
Type:	boolean
Default:	false
Description:	If this parameter is set to `true` and the incoming message passes all other spam tests during filtering, then visit websites indicated by URLs found in incoming messages. The words on these websites are then added to the collection of words used for Bayesian analysis. This parameter is `false` by default because it can be a bit of a drain on network resourses. Also, visiting URLs in messages occurs only during filtering and not during initialization. As a result, turning on this feature may result in false positives if the URLs in legitimate messages represent websites which have a lot of spammy words.

Parameter:	test_msg_urls_timeout
Type:	int
Default:	5
Restrictions:	must be greater than zero
Description:	This parameter specifies the maximum time in seconds to take when visiting the URLs in a message (see `test_msg_urls`). If it takes longer than this time to visit all the websites indicated by all the URLs in the message, then only the web content downloaded before the timeout occurred will be used.

Parameter:	test_octet_samples
Type:	boolean
Default:	true
Description:	Determine whether or not to test messages' IP addresses for blacklisted octets during filtering. During initialization, the IP addresses from which good and bad messages are received are stored in the `word_file`. During filtering, if the incoming message passes the traditional Bayesian test and `test_octet_samples` is `true`, then IP addresses of the message will be analyzed. Each octet that matches the IP octets from which a previously received spam message originated will cause one good discriminating word to be removed and a ficticious bad word to be introduced into the discriminating word samples. The Bayesian calculation is redone on the new list of word probabilities. This feature can help catch spam that would normally slip through the filter.

Parameter:	undecodable
Type:	regex
Default:	/^(image\|application)/i
Description:	Message parts whose `Content-type:` match this regular expression are not decoded during the internal processing of MIME messages. Instead, these message parts will be replaced with the string "`<...>`". Note that the original message is not altered — a copy of the message is actually modified. Setting this parameter to "`//`" inhibits decoding entirely (essentially suppressing all message bodies, including `text/plain` types); while setting it to `""` will cause all content types to be decoded. For example, to decode all message parts (images, attachments, etc.) in `file`, use: `$ ruby buryspam.rb -o undecodable= --decode file` Note that this may generate a lot of binary output.

Parameter:	verbose_hdr
Type:	boolean
Default:	false
Description:	By default, `buryspam` will just add an `X-Buryspam-Spam:` header to messages it filters. If this parameter is `true`, then additional diagnostic header lines are appended to each filtered message header.

Parameter:	word_file(Compulsory)
Type:	file
Default:	""
Description:	The file name of the Bayesian database. This database is created during initialization and read from during filtering. In addition to word probabilities and the initialization timestamp, this file stores the IP addresses from which spam and non-spam messages were delivered. This parameter is compulsory and must be specified with a full (absolute) pathname. A leading `"~"` character may be used to signify a home directory.

Parameter:	word_length
Type:	int_rng
Default:	3..25
Restrictions:	both ends of the range must be greater than zero
Description:	Words whose lengths lie outside this range will not be included in the Bayesian database during initialization. If the Bayesian database (`word_file`) is too large, consider reducing this range.

Parameter:	word_regex
Type:	regex
Default:	%r{[-a-z0-9$_'.!:]+}i
Restrictions:	cannot be empty
Description:	This regular expression is used to break a message into its constituent words during initialization of the Bayesian database and during spam filtering. The regular expression should not contain any capturing parentheses.

Potential Problems

In the interest of full disclosure, here are some issues associated with buryspam:

During --poll mode, the IMAP password is always stored in memory. This may present a security risk. Note that the password is never intentionally written to disk (swap partitions notwithstanding).
If buryspam dies while polling or if the machine on which it is running is restarted, then no incoming mail will be received until buryspam is restarted with the --poll option.
Because of the use of lock files, there is a possibility of deadlock occuring. Hopefully, that possibility should be very remote. If the word_file cannot be saved during initialization due to a timeout waiting for a lock and no other buryspam processes are running, then it is possible that an errant lock may still be held on the word file. Renaming or removing the word file and then re-initializing should correct the problem (but make sure that no other buryspam process will attempt to access the missing word file while it is being regenerated.)
Log files may roll over a little early when more than one buryspam process is run. As a result, a log file will sometimes be rolled over before it reaches the size specified by the log_file_size configuration parameter. Also, two concurrent buryspam processes may step on each other's log entries.
The memory requirements for buryspam are a bit excessive during initialization and filtering. Having multiple copies of buryspam running on a single machine simultaneously may tax its memory resources.
It is not possible to run two transfer related buryspam processes simultaneously. This means, for example, that if you are running a buryspam polling daemon, you cannot subsequently run another polling daemon or invoke buryspam using --transfer or --bulk. This is to prevent the competing IMAP clients from putting the IMAP server's inbox in an inconsistent state. Ideally, concurrent buryspam transfer processes should be permitted if they are accessing different mailboxes and/or different IMAP servers.
After transferring messages from an IMAP server, a Connection reset by peer during disconnect (warning ignored) warning may sometimes occurs. This only appears to happen when connecting to garfield's IMAP server from garfield itself. I'm not sure what the ramifications of this are — at worst, it seems, messages may be left unexpunged from the IMAP mail server. Running a transfer mode (--transfer, --poll or --bulk) on the IMAP server itself is therefore prohibited.

Technical Documentation

Technical documentation is available in rdoc format. (This documentation is intended for those who wish to modify the buryspam code and will likely not be useful to end users.)

Donald Craig (donald@mun.ca)

Last modified: November 18, 2011 17:50:51 NST

buryspam Version 2

Contents

`buryspam`
Version 2