The Mbox class contains a collection of Message objects. Various class methods manange the mail archive, and rotation of the spam file. Contains methods to extract words/IP addresses from the messages in the Mbox for subsequent bayesian analysis and to filter all the messages in the Mbox.
The items we wish to count in the mbox messages. The Message class should have methods corresponding to each of these symbols that counts each of these items in the message
The date regular expression that follows the FROM string, above, in an e-mail message postmark.
The 'From ' line pattern denotes the start of a new message in a standard mbox. Note that we allow for the email address to have a space.
The entire regex representing the postmark is captured in $1 so that the 'From ' lines will be included in the array returned by split in Mbox#initialize.
Use this regexp when extracting the date from the postmark (see Buryspam::Message.extract_time). $1 captures the date.
Write the given message to the archive mbox file, if configured. Rotate the archive file if necessary.
# File buryspam.rb, line 3577 def archive(msg) Logger.debug("Testing archiving...") unless Startup.new_messages? Logger.debug("Messages are not new. Not archiving.") return end begin filename = Config.archive_file max_size = Config.archive_file_size if filename.nil? || filename.empty? || max_size <= 0 Logger.debug("Not configured for archiving.") return end rotate(filename, msg.size, max_size, Time.now) Lockfile.open(filename, File::LOCK_EX) { |f| Logger.debug("Archiving message to '#{filename}'...") f.print(msg) } Logger.debug("Archiving complete.") rescue Status.error($!) end end
Determine the count type that the Mbox should employ.
If all the mbox messages lie inside the initialization time range, then
determine total counts (:total
).
If some of the mbox messages overlap with the initialization time range,
then time-index counts should be generated (:times
)
Otherwise, no messages occur during the initialization time range, so there
is no need to count anything (nil
)
This is defined as a class method so as to avoid the (expensive) creation of an Mbox object when it it's not necessary.
# File buryspam.rb, line 3654 def count_type(init_date_range, mbox_date_range, num_msgs) return nil if num_msgs.zero? return mbox_date_range.within?(init_date_range) ? :total : mbox_date_range.overlaps?(init_date_range) ? :times : nil end
Returns an initialized messages/word/ip-address counter hash.
# File buryspam.rb, line 3678 def init_counts(num_msgs = 0) counts = { :num_msgs => num_msgs } COUNTERS.each { |counter| counts[counter] = Hash.new(0) } counts end
Return true
if the given filename represents a valid mbox. A
filename represents a valid inbox if the name is not nil
and
it is an existing file and if the first line is a valid mbox 'From
'
line (see POSTMARK).
# File buryspam.rb, line 3566 def is_valid?(filename) if filename.nil? || ! File.file?(filename) return false end Lockfile.open(filename) { |f| POSTMARK.match(f.gets) } end
Create a new mbox given a string representing the contents of an mbox file.
Spam related X-
headers will be
stripped from the messages' headers unless the opts parameter has
:strip_buryspam_hdrs
set to false
.
# File buryspam.rb, line 3719 def initialize(contents, opts = {}) params = { :strip_buryspam_hdrs => true }.merge(opts) raise "Mbox has no contents." if contents.nil? || contents.empty? first_line = contents[/.*\n/] || "<no input lines>" unless POSTMARK.match(first_line) raise "Invalid mbox: '#{first_line.chomp}...'" end # Shift out the first (empty) element. (msgs = contents.split(POSTMARK)).shift unless msgs.size % 2 == 0 raise "Bad mbox format (missing a msg body?)" end @msgs = [] @date_range = oldest = newest = nil Logger.debug("Separating into messages...") until msgs.empty? do msg = Message.new(msgs.shift + msgs.shift, params) next if msg.is_admin? newest = msg.time if newest.nil? || newest < msg.time oldest = msg.time if oldest.nil? || oldest > msg.time @msgs << msg end Logger.debug("Finished separation.") unless oldest.nil? || newest.nil? # Note: Use a builtin +Range+ object and not a +DateRange+ # because this range is stored in the Bayesian database and # would therefore require always including +DateRange+ when # loading that database. @date_range = Range.new(oldest, newest) end end
If there is input on standard input, then read it and create/yield a corresponding mbox. Otherwise, read the contents of each file on the command line and create/yield an mbox for each one. (Yielding is done to reduce memory consumption if a long list of mboxes is specified on the command line, for example.)
# File buryspam.rb, line 3609 def read(opts = {}) # If any input is available on stdin, then let that take priority. if $stdin.has_input? || ARGV.empty? Logger.debug("Reading from stdin..."); input = $stdin.binread Logger.debug("Read %d byte%s from stdin.".pluralize(input.size)) yield Mbox.new(input, opts) else # Read mboxes specified on the command line. ARGV.each { |file| mbox = read_file(file, opts) # Raising an exception seems too drastic if the file is not a valid # mbox, especially if there are other valid filenames later on the # command line. Just log a message and continue to the next file. if mbox.nil? Status.error("'#{file}' not a valid mbox.") next end yield mbox } end end
Read the contents of a mbox from the specified file. Returns an Mbox object or nil if the specified file is not a valid mbox.
# File buryspam.rb, line 3634 def read_file(filename, opts = {}) unless Mbox.is_valid?(filename) Status.error("'#{filename}' not a valid mbox.") return nil end contents = nil Lockfile.open(filename) { |f| contents = f.read } Logger.debug("Read %d byte%s from '#{filename}'.".pluralize(contents.size)) return Mbox.new(contents, opts) end
Rename the mbox represented by filename if it will be larger than max_size if a message of size msg_size is added to it. Append a date represented by the time parameter to the rotated file name. This method is used to rename the archive and spam mboxes.
# File buryspam.rb, line 3690 def rotate(filename, msg_size, max_size, time = nil) Logger.debug("About to rotate '#{filename}'...") unless File.file?(filename) Logger.warn("'#{filename}' not a file. No rotation done.") return end file_size = File.size(filename) Logger.debug("File size : %9d" % file_size) Logger.debug("Message size : %9d" % msg_size) Logger.debug("New file size : %9d" % (file_size + msg_size)) Logger.debug("Maximum size : %9d" % max_size) if file_size + msg_size <= max_size Logger.debug("'#{filename}' not yet large enough to rotate.") return end if time.nil? postmark = Lockfile.open(filename) { |f| f.gets } time = Message.extract_time(postmark) end new_filename = "%s-%s" % [filename, time.strftime("%Y-%m-%d")] FileUtils.rename_file_uniq(filename, new_filename) end
Convert a time-indexed based count to a total count.
# File buryspam.rb, line 3662 def total_count(time_counts) return if time_counts.nil? new_counts = { :total => init_counts } total_count = new_counts[:total] time_counts.each { |time, counts| total_count[:num_msgs] += counts.delete(:num_msgs) counts.each { |item, cnt| total_count[item].merge!(cnt) { |k, n1, n2| n1 + n2 } } } new_counts end
Return a hash containing the counts of messages, words and IP address in the Mbox.
The structure of the cache is slightly different depending upon whether the
::count_type is
:total
,
{:total=> {:num_msgs=>2223, :ipaddrs=> {"201.239.80.116"=>1, "77.246.104.147"=>3, ... "83.96.36.53"=>1, "81.23.122.146"=>6}, :words=> {"85C3210E77A"=>1, "4A1E3100DAE"=>1, ... "x5SJ"=>1, "129D310054D"=>1}}}
or :times
,
{Thu Jul 03 08:36:15 -0230 2008=> {:num_msgs=>1, :ipaddrs=> {"127.0.0.1"=>3, "134.153.48.18"=>1, ... "134.153.232.77"=>1}, :words=> {"H"=>1, "m63B2uKZ012666"=>1, ... "1.0"=>1, "To:"=>1}}, ... Wed Jul 02 17:50:34 -0230 2008=> {:ipaddrs=> {"127.0.0.1"=>9, ... "78.183.15.215"=>1, "89.137.220.134"=>1}, :num_msgs=>3, :words=> {"8EB0B100F30"=>1, "Return-Path:"=>3, ... "koi8-r"=>1, "amavisd-new"=>6}}}
# File buryspam.rb, line 3829 def counts(count_type) return if count_type.nil? counts = {} if count_type == :total count = counts[count_type] = Mbox.init_counts(@msgs.size) end Progress.new(@msgs, :each) { |msg| if count_type == :times count = counts[msg.time] ||= Mbox.init_counts count[:num_msgs] += 1 end COUNTERS.each { |item| cnt = msg.send(item) count[item].merge!(cnt) { |k, n1, n2| n1 + n2 } } } counts end
Loop over each message in the Mbox and yield them to the calling code.
# File buryspam.rb, line 3855 def each_msg @msgs.each { |msg| yield(msg) } end
Test all the messages in the mbox for spam. If a block is passed, then yield each processed message to the block along with a boolean indicating whether or not the message was spam. Returns a list of all the filtered messages.
# File buryspam.rb, line 3877 def filter(&block) Logger.debug("Filtering %d message%s...".pluralize(num_msgs)) # Accumulate the filtered messages and return them as an # array when they've all been filtered. msgs = [] @msgs.each { |msg| Logger.debug(msg.postmark) Logger.debug(msg.hdr_field('Subject') || "(no subject)") Logger.debug("Message size: %d byte%s.".pluralize(msg.size)) is_spam = false unless Config.filter_date_range.cover?(msg.time) Logger.debug("Message not in time range -- not filtered.") else begin Logger.debug("Starting spam test.") is_spam = msg.is_spam? # If the message is spam, see if it's time to rotate spam mbox. Spam.instance.rotate(msg.size) if is_spam Logger.info("#{is_spam ? " " : "NOT"} SPAM: #{msg.postmark}") rescue Bayesian::DatabaseError # If we can't find the Bayesian database, then assume non-spam. Status.error($!.message) is_spam = false end end msgs << msg yield(msg, is_spam) if block_given? Logger.debug("---") } Logger.debug("Finished filtering.") msgs end
Returns a hash containing various metadata about the mbox which is stored in the meta cache. This metadata is used to speed up initialization of the filter.
# File buryspam.rb, line 3760 def metadata(gb, init_date_range, mbox_mtime) { :type => gb, :ver => VER, :mtime => mbox_mtime, :date_range => @date_range, :num_msgs => num_msgs, :count_type => Mbox.count_type(init_date_range, @date_range, num_msgs) }.merge( # Fold in the relevant configuration parameter settings, too. Hash[ *Config::META_CACHE.map { |key| [key.to_sym, Config.send(key)] }.flatten ] ) end
Returns the number of messages in the Mbox.
# File buryspam.rb, line 3862 def num_msgs @msgs.size end
Convert the mbox to a giant string.
# File buryspam.rb, line 3867 def to_s @msgs.map { |msg| msg.to_s }.join("") end