`buryspam`

A Bayesian Spam Filter

Introduction

This page describes how to use the buryspam.rb script. For an informative and informal discussion of the practice and theory behind Bayesian spam filtering, please see Paul Graham's A Plan For Spam web page, which provided much of the inspiration for the implementation of this program.

Note that the script as well as this documentation is very much a work in progress. Please report any problems to donald@cs.mun.ca

Local Administrative Notes

The buryspam.rb script was developed using version 1.8.0 of ruby (Dated 2003-08-04). This version of ruby is currently installed on garfield. Other versions of ruby may or may not work as expected.

If the grad partition becomes inaccessible, mail delivery may be adversely affected if you call the script on the grad partition from your ~/.procmailrc file. procmail should try to recover messages that could not be processed due to a missing filter, but I've experienced mixed results in this respect.

To be safe, you may want to copy the the buryspam.rb script to your own directory instead of invoking it directly from /users/cs/grad/donald/pub/bin. This way, if the grad partition becomes inaccessible, your copy of the buryspam.rb script will still be accessible.

Is `buryspam.rb` appropriate for you?

Note that if you use an e-mail client that transfers messages off of garfield (using, for example, POP or IMAP), then the buryspam.rb filtering script may not be appropriate for you. This is because the script relies on having a collection of good and bad messages stored on the server for initialization purposes. It may be possible to use buryspam.rb on the client side to initialize the filter but the file resulting from the initialization would have to be transferred manually to the server. This may be more trouble than it's worth. Also, note that the buryspam.rb filter requires that all messages be stored in Unix mbox format.

buryspam.rb is more suited to those who read their e-mail using client software directly on the mail server itself (e.g. pine, mutt, elm etc.). You also have to be willing to modify configuration files and your .procmailrc file.

Getting started

Move legitimate and spam messages to separate directories: In order for Bayesian filtering to be effective, one must have an existing collection of legitimate e-mail (this includes messages from mailing lists and personal mailings -- both from you and to you) and also a collection of spam messages. Typically, a few thousand of each is required for accurate results, but be careful not to use too many, otherwise the generation of the word probability file may consume too much time and resources. The folders or mboxes containing legitimate messages and spam messages should be placed in separate directories. For example, it is common to keep all legitimate messages in your ~/mail directory. Spam messages can be kept in a subdirectory of this directory (e.g. ~/mail/junk).
Create a ~/.buryspamrc file: This file contains all the configuration options related to the bayesian filtering. Only three parameters must be set explicitly. You must set the good_dirs parameter to the directory containing the legitimate mail. If you have more than one directory that contains legitimate messages, set good_dirs to a comma delimited list of directory names. Next, you must set the bad_dirs parameter to the directory containing all the spam mail. As with the good_dirs parameter, you can specify a comma separated list of such directories. Last, you must specify the name of the file that will be used to store the word probability database (which is generated automatically below). Ideally, you should probably specify a file that is in a subdirectory of your ~/mail directory (e.g. ~/mail/lib). Make sure that you create this directory if is doesn't already exist. It is important to specify the full path name of the good/bad directories and the word database file. You may use the tilde ~ character to represent your home directory. All three of these parameters are strings and therefore must be delimited by double-quotes in the rc file. For example, the following minimal ~/.buryspamrc file indicates that legitimate messages are stored in the directories ~/mail and ~/mail/old directories; a collection of spam messages are stored in the ~/mail/junk folder and the word probability database file is ~/mail/lib/words
```
          good_dirs = "~/mail,~/mail/old"
          bad_dirs  = "~/mail/junk"
          word_file = "~/mail/lib/words"
```
Again, remember that the directory containing the word probability database file (~/mail/lib in the example above) must be created if it doesn't exist. The actual word file itself will be generated in the next step.
Run buryspam.rb --init: When supplied with the --init option, the script will scan through all the legitimate message and spam directories specified in the configuration file and count the messages and word frequencies present in each. Upon processing all the messages, the bayesian probabilities are calculated for each word and the word probability database (specified by the word_file configuration parameter above) is populated. This process will take some time (upto several minutes). If is taking too long, then you may want to terminate the process (by pressing Ctrl-C) and archive some of the older messages to other directories which will not be processed by the script.
Create/update your ~/.procmailrc file: Make sure you have a ~/.procmailrc file whose contents resemble the following:
```
	SHELL=/bin/sh
	PATH=/bin:/usr/bin
	LOGFILE=$HOME/mail/.procmail.log
	MAILDIR=$HOME/mail

	:0fw
	| /users/cs/grad/donald/pub/bin/buryspam.rb --filter

	:0:
	* ^X-Bayesian-Spam: Yes
	junk/spam
```
Note that the junk directory specified above in the ~/.procmailrc file must be created in your ~/mail directory (if it doesn't already exist).

That's it. After this, all messages you receive will be sent through the bayesian filter. If it detects a spam message it will add the header line X-Bayesian-Spam: Yes to the message. Otherwise, the header line X-Bayesian-Spam: No is added. The ~/.procmailrc file will then deposit any messages with the spam header to the ~/mail/junk/spam folder and leave all your legitimate messages in your default mail spool. Extra header lines containing the actual bayesian value calculated for the message as well as the words that sparked the bayesian filter's "interest" can also be generated by setting the verbose_hdr parameter in your ~/.buryspamrc file to true.

Important Notes

False positives (legitimate messages which get filtered as spam), while very, very rare, can still occur. As a result it is a good idea to occasionally go through your spam folder and see if any legitimate messages accidentally got mis-filtered. If you do find such mis-filtered messages, move them back to a folder in one of you legitimate mail directories and re-initialize the word probability database (i.e. re-run buryspam.rb --init).
If you had a reasonable collection of spam messages when you started bayesian filtering, it should not be necessary to regenerate the word probability database too often. As a rule of thumb you may want to regenerate the database whenever new variants of spam messages start to seep through the filter. Because the regeneration of the word probability database can consume significant resources, it is a good idea to do so only during off-peak hours.
Generally speaking, you should not delete spam messages. They can be used again the next time you regenerate your word database. The more spam you have, the more resilient the filter becomes the next time you regenerate the word probability database.
In order to make subsequent regenerations of the word database fast, the program generates a cache of word counts for each mail folder that it encounters during its initial scan. By default, these files are located in a subdirectory of the directory containing the word database file. During subsequent reinitializations of the word probability database, these caches are consulted and used if the corresponding mail folder has not changed.
The filter will automatically attempt to rename spam files that have accumulated about five megabytes of data. Doing will make re-initialization the word database more efficient since the word count of the newly renamed spam file will only have to be done once more. After that, the results of the cache can be used as long as the contents of the renamed spam file remains the same. New spam messages will automatically be deposited in the a new spam file.
If "bad things" are happening to your messages (i.e. legitimate messages are not getting through), you may want to disable the bayesian filter by temporarily commenting out the appropriate lines in the ~/.procmailrc file (you may also temporarily rename your ~/.procmailrc file). You should then check your ~/mail/.procmail.log file for any errors. It's always a good idea to check your ~/mail/.procmail.log for errors and warning messages.

Configuration File

The format of each line of the configuration file, ~/.buryspamrc, is:

parameter_name = parameter_value

Only the first three parameters given below compulsory, the remainder are optional and can be used to fine tune the filter. The defaults, many of which are from Paul Graham's web page described above, are usually sufficient.

The buryspam.rb script tries to be helpful in diagnosing errors in the rc file. If there are any errors in the configuration file, the script will simply pass any messages through unfiltered and then terminate. If the script is being run from your ~/.procmailrc file, errors will be logged to your ~/mail/.procmail.log file (provided LOGFILE is set appropriately in your ~/.procmailrc file).

The following options are supported by the buryspam.rb script in the ~/.buryspamrc configuration file.

	`good_dirs`
Compulsory?	Yes
Type	String
Default	`""`
Description	Comma separated string of directories containing legitimate e-mail mboxes. The directories must be fully specified (i.e. no relative pathnames), but the `~` character may be used to represent a home directory.

	`bad_dirs`
Compulsory?	Yes
Type	String
Default	`""`
Description	Comma separated string of directories containing spam e-mail mboxes. The directories must be fully specified (i.e. no relative pathnames), but the `~` character may be used to represent a home directory.

	`word_file`
Compulsory?	Yes
Type	String
Default	`""`
Description	Full pathname of the word database to use. The `~` character may be used to represent a home directory.

	`cache_dir`
Compulsory?	No
Type	String
Default	`""`
Description	This parameter determines the directory for the cache files created during initialization of the filter. The full path name must be specified and the parent directory must exist. The `~` character may be used to represent a home directory.

	`good_init_weight`
Compulsory?	No
Type	Integer
Default	`2`
Description	Word counts occurring in the messages in the `good_dirs` directory are multiplied by this amount during initialization of the filter. Setting this too high may result in more false negatives.

	`bad_init_weight`
Compulsory?	No
Type	Integer
Default	`1`
Description	Word counts occurring in the messages in the `bad_dirs` directory are multiplied by this amount during initialization of the filter. Setting this too high may result in more false positives.

	`good_select_weight`
Compulsory?	No
Type	Integer
Default	`2`
Description	During filtering, good and bad words are extracted in the same ratio as the good_select_weight to bad_select_weight (described below) ratio. This parameter biases the filter in favour of not treating messages as spam. Setting this too high may result in more false negatives. Setting this parameter to zero will turn off weighted selection.

	`bad_select_weight`
Compulsory?	No
Type	Integer
Default	`1`
Description	During filtering, good and bad words are extracted in the same ratio as the good_select_weight (described above) to bad_select_weight ratio. This parameters biases the filter in favor of treating messages as spam. Setting this too high may result in more false positives. Setting this parameter to zero will turn off weighted selection.

	`ignore_probs`
Compulsory?	No
Type	Float Range
Default	`0.3..0.7`
Description	Do not store words in the probability database whose probabilities lie in this range. For example, the default setting, 0.3..0.7, will cause the initialization routine to only write words to the database whose probabilities are between 0 and 0.3 or between 0.7 and 1. This may help speed up the loading of the word database during filtering as the word database will be (slightly) smaller.

	`ignore_words`
Compulsory?	No
Type	Regexp
Default	`//`
Description	This parameter is used to determine which words to ignore during initialization. If a word matches this regular expression, then it is not written to the probability database. This parameter can be useful if spam starts to creep into a mailing list and you want to ignore words in the message's header that erroneously push the message into the non-spam category. By default, this parameter matches nothing.

	`ignore_mboxes`
Compulsory?	No
Type	Regexp
Default	`//`
Description	During initialization, this parameter is used to determine which mailboxes to ignore when doing word counting and initial probability calculations. If the name of the mailbox matches the regular expression stored in this parameter, then the mailbox is ignore during initialization of the probability database. By default we do not ignore any mboxes.

	`min_word_num`
Compulsory?	No
Type	Integer
Default	`5`
Description	This parameter determines how many times a word must be encountered before it can be considered for inclusion in the word database as either a good or bad word. This parameter takes into account the respective weightings above. If this parameter is set too small, the word database will grow larger, possibly negatively impacting run-time performance. If it is set too high, the number of words stored in the database will decrease, possibly negatively impacting accuracy.

	`word_samples`
Compulsory?	No
Type	Integer
Default	`15`
Description	The number of words that the filter uses in calculating the probability that the message is spam.

	`verbose_hdr`
Compulsory?	No
Type	Boolean
Default	`false`
Description	This parameter generates two extra header lines for each message that it processes. The X-Bayesian-Value: header line will indicate the probability that the message is spam and the X-Bayesian-Words: header line will contain the list of words that were used to judge the legitimacy of the message.

	`spam_threshold`
Compulsory?	No
Type	Float
Default	`0.9`
Description	Any messages whose bayesian probability are greater than this number are classified as spam. Other messages are treated as non-spam.

	`poison_threshold`
Compulsory?	No
Type	Float
Default	`2.0`
Description	If Bayesian analysis determined that the message was not spam, but the ratio of bad words to good words at the extrema is greater than this number, the message will be classified as spam. This should reduce the false negatives that occur when spammers deliberately attempt to poison filters by including a lot of superfluous words. When calculating the ratio, the `remove_dups` parameter is consulted to determine whether duplicate good and/or bad words should be counted when determining the ratio. Also, we require that the number of words used to calculate the ratio be greater than or equal to `word_samples`. To disable this feature, set `poison_threshold` to 0.

	`bad_prob`
Compulsory?	No
Type	Float
Default	`0.99`
Description	This is the probability assigned a word that has been deemed to be an extremely strong indicator of spam.

	`good_prob`
Compulsory?	No
Type	Float
Default	`0.01`
Description	This is the probability assigned a word that has been deemed to be an extremely strong indicator of non-spam.

	`default_prob`
Compulsory?	No
Type	Float
Default	`0.4`
Description	This parameter determines what probability the filter assigns to words that it has not encountered before in incoming messages. By setting this parameter low, you are giving the message sender the benefit of the doubt.

	`archive_file`
Compulsory?	No
Type	String
Default	`""`
Description	The name of the file which should hold verbatim copies of all incoming messages. This parameter was useful only for debugging purposes and should generally not be used.

	`word_length`
Compulsory?	No
Type	Integer Range
Default	`3..20`
Description	Words outside this range are not considered for inclusion in the word probability database. For example, with the default setting of 3..20, words that consist of only one or two characters or words that are greater than twenty characters are rejected for inclusion in the word database.

	`word_regex`
Compulsory?	No
Type	Regexp
Default	`/[-A-Z0-9$'\x92\\_\[!]+/i`
Description	The regular expression used to determine what constitutes a word in a message.

	`decode`
Compulsory?	No
Type	Regexp
Default	`/\.(te?xt\|rtf\|html?\|scr\|pif\|exe\|com\|wpd\|doc\|xls\|ppt\|zip?)$/i`
Description	During initialization and filtering, the script will attempt to decode encoded message attachments for word extraction. This can increase the detection of spam messages which, for example, are base64 encoded. This parameter is a regular expression which is tested against the filename of each attachment. If it matches, the filename is decoded at word extraction of the decoded attachment will occur.

	`max_decode_part`
Compulsory?	No
Type	Integer
Default	`1048576`
Description	Encoded message parts that are larger than this configuration setting are not decoded. The specified size is in bytes, so the default setting will prevent encoded message bodies that are larger than one megabyte from being decoded. To decode ALL message parts, regardless of size, set this parameter to -1. To decode no message parts, set it to 0.

	`replace_8bit`
Compulsory?	No
Type	Boolean
Default	`false`
Description	When set to true, this parameter will replace 8bit characters in (nonspam) text messages with their corresponding 7bit-clean counterparts during filtering. Such 8bit characters (e.g. 0x92 instead of ') are sent by Microsoft Windows MUAs apparently unbeknownst to the sender. This seems to be occurring more frequently.

	`spam_file_size`
Compulsory?	No
Type	Integer
Default	`5242880`
Description	When the spam file becomes greater than this size attempt to rename the spam file and let a new one get created. This parameter helps prevent spam files from becoming increasingly larger, thereby slowing down the initialization process. To prevent the renaming of the spam file, set this parameter to a value less than or equal to zero. The program attempts to determine the spam file based upon the contents of the `~/.procmailrc` file.

	`strip_html_comments`
Compulsory?	No
Type	String
Default	`limited`
Description	This parameter determines how HTML comments in messages are handled during initialization and filtering. Comments consist of text delimited by `<!--` and `-->`. This parameter may be set to one of three modes: `all` In this mode, all HTML comments are stripped from the message. `none` In this mode, no HTML comments are stripped from the message. `limited` In this mode, if the `word_regex` regular expression contains the `'-'` character, then words in the comment that are directly adjacent to the beginning or ending `--` HTML comment delimiters are preserved, while other words are dropped. Such words (in conjunction with the `--` characters) can act as useful discriminators. All space-delimited words occurring in HTML comments are deleted.

	`strip_html_tags`
Compulsory?	No
Type	String
Default	`invalid`
Description	During initialization and filtering, this parameter is consulted to determine how to handle HTML tags in HTML messages. The parameter can be set to one of three values: `all` In this mode, all HTML tags are stripped from the message. `none` In this mode, no HTML tags are stripped from the message. `invalid` In this mode, only HTML tags deemed to be invalid are stripped from the message. Using this value can help reduce false negatives by stripping "hidden" words inside tags which could poison bayesian filters.

	`strip_html_brackets`
Compulsory?	No
Type	Boolean
Default	`false`
Description	When an HTML tag or comment is stripped using the `strip_html_tags` and/or `strip_html_comments` configuration parameters respectively, this parameter is consulted to determine whether or not the `<...>` brackets are stripped too. This parameter is set to `false` by default, because spammers use HTML comments/tags to break up words and the resulting fragmented words can be very helpful in isolating spam. By preserving the brackets, we ensure that the words will remain broken up during tokenization of the message.

	`remove_dups`
Compulsory?	No
Type	String
Default	`"none"`
Description	During filtering, this parameter is used to determine how the filter should handle duplicate words during the evaluation of the message. The parameter can have four values: `all` Duplicate words are dropped during filtering. `none` No duplicate words are dropped during filtering. This means that every word in the message will be considered for evaluation even if it occurs several times. `good` During filtering, only duplicate good words are removed from the entire message. Using this value could decrease the false negatives at the expense of introducing more false positives. `bad` During filtering, only duplicate bad words are removed from the entire message. Using this value could decrease the false positives at the expense of introducing more false negatives. `header` Duplicate words (both good and bad) are removed only from the header of the message during filtering. `body` Duplicate words (both good and bad) are removed only from the body of the message during filtering.

	`epoch`
Compulsory?	No
Type	Date
Default	`January 1, 1970`
Description	This parameter is used during automated diagnostic reporting and represents the start point of the filter testing.

	`seed_interval`
Compulsory?	No
Type	Float
Default	`200.0`
Description	This parameter is used during automated diagnostic reporting and represents how many days to seed the filter during its first initialization.

	`init_interval`
Compulsory?	No
Type	Float
Default	`14.0`
Description	This parameter is used during automated diagnostic reporting and represents the length of the initialization interval, in days, that should be used to initialize the filter starting from the first seeding period and for each test interval thereafter.

	`test_interval`
Compulsory?	No
Type	Float
Default	`14.0`
Description	This parameter is used during automated diagnostic reporting and represents the length of the test interval, in days, that should be used to test the filter. Each test interval is added from the end of the initialization period determined using the `init_interval` parameter, above.

	`slide_epoch`
Compulsory?	No
Type	Float
Default	`0.0`
Description	This parameter determines how many days the epoch should be moved ahead for each initialization during automated diagnostic generation. A value of `0.0` keeps the epoch constant.

	`uniq_serial`
Compulsory?	No
Type	Integer
Default	`4`
Description	During report generation and the renaming of the spam file, serial numbers are used in order to ensure unique file generation. This parameter determines how many digits to use in the serial numbers.

Usage

buryspam.rb <option> [mbox files]

For most (but not all) options, if no mboxes are specified on the command line, then input will be taken from stdin.

option can be specified as follows:

	`--initialize`
Argument Options	`none`
Short Form	`-i`
Description	Generate the word probability file. The filter must be initialized once in order to be used. Subsequent initializations may take place at the user's discretion (e.g. whenever a message is misclassified).

	`--filter`
Argument Options	`none`
Short Form	`-f`
Description	Filter the messages using Bayesian filtering. This option implies decoding of the message. The Bayesian filter has to be initialized (using `--initialize`) before filtering can be performed.

	`--decode-only`
Argument Options	`none`
Short Form	`-d`
Description	Decode messages only. No filtering is done and the resulting message is displayed to standard output. Useful for debugging purposes.

	`--colour`
Argument Options	`none`
Short Form	`-c`
Description	Show the messages with the 'interesting' words colour-coded. Bad words are shown in varying degrees of red, while good words are shown with varying degrees of green. This argument implies decoding.

	`--report`
Argument Options	`none`
Short Form	`-r`
Description	Batch filter all messages then generate a summary report (in the file `summary.rpt` of the filtering results. False positive and false negatives encountered are stored in files `false_positives.` and `false_negatives.` (where `*` represents a unique serial number) for subsequent examination by the user.

	`--grep`
Argument Options	`<regex>`
Short Form	`-g`
Description	Display all messages that contain the pattern `regex`. The regular expression must be delimited by `/.../` and `i` `m` and `x` options may be specified at the end of the regular expression. If no mbox files are specified on the command line, then search through all mbox files in the `good_dirs` and `bad_dirs` directories. This option does not actually decode base64/quoted-printable messages when doing the grep.

	`--auto`
Argument Options	`none`
Short Form	`-a`
Description	Conduct an automatic, iterative test of the filter to assess its effectiveness. See the `~/.buryspamrc` configuration parameters `epoch`, `seed_interval`, `init_interval`, `test_interval` and `slide_epoch` for more information.

	`--stats`
Argument Options	`none`
Short Form	`-s`
Description	Display monthly volumes of good/bad mail as well as the percentage of spam received during each month.

	`--verbose`
Argument Options	`none`
Short Form	`-v`
Description	Display some diagnostic data during processing. This considerably clutters up the output.

	`--begin`
Argument Options	`'[good\|bad] <date\|relative-time>'`
Short Form	`-b`
Description	For `--initialize`, initialize the filter using the `good` or `bad` messages dated on or after `date`. The `good`/`bad` parameter values are optional. For `--decode-only`, `--colour`, `--filter`, `--grep`, `--stats` and `--report`, process messages dated on or after `date`. A `relative-time` can be specified instead an absolute date. For example, `2 months` will be translated into a timestamp two months in the past. Relative `years` or `days` may also be specified. All such times are relative to the present and occur in the past. Often used with `--end` argument. Cannot be used with `--auto` argument.

	`--end`
Argument Options	`'[good\|bad] <date\|relative-time>'`
Short Form	`-e`
Description	For `--initialize`, initialize the filter using the `good` or `bad` messages dated upto but not including `date`. The `good`/`bad` parameter values are optional. For `--decode-only`, `--colour`, `--filter`, `--grep`, `--stats` and `--report`, process messages dated upto and including `date`. A `relative-time` can be specified instead an absolute date. For example, `2 months` will be translated into a timestamp two months in the past. Relative `years` or `days` may also be specified. All such times are relative to the present and occur in the past. Often used with `--begin` argument. Cannot be used with `--auto` argument.

	`--parameter`
Argument Options	`<attr val ...>`
Short Form	`-p`
Description	Override the configuration parameters in `~/.buryspamrc` using the supplied attribute/value pairs. Can be useful during automated testing.

If standard input is available to the running script, then ALL options are ignored and the script runs in --filter mode.

Exactly one operating mode of --initialize, --decode-only, --colour, --filter, --grep, --stats, --report xor --auto MUST be specified.

If no valid mbox files are given on the command lines for the arguments --decode-only, --colour, --filter, --report or --grep then ALL the mboxes in the good/bad directories (as specified in ~/.buryspamrc) are used.

Sample `~/.buryspamrc` File

	good_dirs 	= "~/mail,~/mail/old"
	bad_dirs 	= "~/mail/junk,~/mail/more-junk,~/mail/even-more-junk"
	word_file	= "~/mail/bin/ruby/words"

	word_length	= 3..16
	ignore_probs	= 0.3..0.7
	verbose_hdr	= true
	spam_threshold  = 0.8

Potential Problems/Bugs

Because the ruby binary and the script reside on the grad partition, mail will not be filtered if this partition is down.
If a message is received while the word database is being reinitialized, then there is a (very) small window of time that the filter may read a corrupt database. This issue will be resolved when I find a reliable way to do file locking over NFS with ruby.
Cache files will become wasted space whenever its corresponding mail folder is removed. Periodically removing all the files in the cache directory and regenerating the word database from scratch will correct this (but that is kind of expensive).
Unfortunately, the processing of the ~/.buryspamrc file makes liberal use of the eval command in ruby. This allows for robust type checking of the various configuration options. However, any ruby commands present in the rc file will be executed with the same permissions of the the person running the script. Needless to say, this is a potential security hole if your rc file is ever compromised.
I'm not sure if this method is suitable for widespread deployment. The script can take a few seconds to process each incoming message (most of the time is spent reading the word probabilities file). If a site receives hundreds of messages a second, then the filter will very quickly become a bottleneck if it is used to process every message for every user at the site.
If you rename an mbox file (either spam or non-spam) you may have to manually delete its corresponding cache file if the contents of the mbox also change.
The program consumes quite a bit of memory during the initialization of the word probability database. Many improvements were made since previous versions, but there is probably still room for further improvements.
Because the filter reads the contents of the entire message before processing it, there is a potential for a DoS attack if a sufficiently large message is sent to the filter.

Potential Improvements

One improvement may be to allow spam folders to be deleted and instead rely solely on the contents of its corresponding cache file during re-initialization. Unfortunately, this will mean that generating a new word database based using a different word_regex or word_length, for example, would not be possible.
Compressing the cache files would help to save space. But filtering would be slightly slower.
Rather than using md5sums to determine if an mbox was changed, it would be simpler (and faster) just to compare the timestamps of the mbox and the corresponding word cache file during word file initialization. Unfortunately, if the mbox is somehow touched (even without otherwise being changed), then the word cache would have to be regenerated. Worse, if the word cache file was accidentally touched after an mbox was updated, the updated mbox would not be processed during re-initialization.
The ability to incrementally add/remove messages from an mbox and have the corresponding word counts/probabilities updated may be a good idea.
The program would likely be faster if it were written in C.

--
Donald Craig (donald@cs.mun.ca)
Mon Mar 28 12:32:47 NST 2005

buryspam