class Buryspam::Mbox

The Mbox class contains a collection of Message objects. Various class methods manange the mail archive, and rotation of the spam file. Contains methods to extract words/IP addresses from the messages in the Mbox for subsequent bayesian analysis and to filter all the messages in the Mbox.

Constants

COUNTERS

The items we wish to count in the mbox messages. The Message class should have methods corresponding to each of these symbols that counts each of these items in the message

DATE

The date regular expression that follows the FROM string, above, in an e-mail message postmark.

FROM

The 'From ' line pattern denotes the start of a new message in a standard mbox. Note that we allow for the email address to have a space.

POSTMARK

The entire regex representing the postmark is captured in $1 so that the 'From ' lines will be included in the array returned by split in Mbox#initialize.

POSTMARK_DATE

Use this regexp when extracting the date from the postmark (see Buryspam::Message.extract_time). $1 captures the date.

Public Class Methods

archive(msg) click to toggle source

Write the given message to the archive mbox file, if configured. Rotate the archive file if necessary.

# File buryspam.rb, line 3577
def archive(msg)
  Logger.debug("Testing archiving...")
  unless Startup.new_messages?
    Logger.debug("Messages are not new.  Not archiving.")
    return
  end

  begin
    filename = Config.archive_file
    max_size = Config.archive_file_size
    if filename.nil? || filename.empty? || max_size <= 0
      Logger.debug("Not configured for archiving.")
      return
    end

    rotate(filename, msg.size, max_size, Time.now)

    Lockfile.open(filename, File::LOCK_EX) { |f|
      Logger.debug("Archiving message to '#{filename}'...")
      f.print(msg)
    }
    Logger.debug("Archiving complete.")
  rescue
    Status.error($!)
  end
end
count_type(init_date_range, mbox_date_range, num_msgs) click to toggle source

Determine the count type that the Mbox should employ.

  • If all the mbox messages lie inside the initialization time range, then determine total counts (:total).

  • If some of the mbox messages overlap with the initialization time range, then time-index counts should be generated (:times)

  • Otherwise, no messages occur during the initialization time range, so there is no need to count anything (nil)

This is defined as a class method so as to avoid the (expensive) creation of an Mbox object when it it's not necessary.

# File buryspam.rb, line 3654
def count_type(init_date_range, mbox_date_range, num_msgs)
  return nil if num_msgs.zero?
  return mbox_date_range.within?(init_date_range) ? :total :
         mbox_date_range.overlaps?(init_date_range) ? :times :
         nil
end
init_counts(num_msgs = 0) click to toggle source

Returns an initialized messages/word/ip-address counter hash.

# File buryspam.rb, line 3678
def init_counts(num_msgs = 0)
  counts = { :num_msgs => num_msgs }
  COUNTERS.each { |counter|
    counts[counter] = Hash.new(0)
  }
  counts
end
is_valid?(filename) click to toggle source

Return true if the given filename represents a valid mbox. A filename represents a valid inbox if the name is not nil and it is an existing file and if the first line is a valid mbox 'From ' line (see POSTMARK).

# File buryspam.rb, line 3566
def is_valid?(filename)
  if filename.nil? || ! File.file?(filename)
    return false
  end
  Lockfile.open(filename) { |f|
    POSTMARK.match(f.gets)
  }
end
new(contents, opts = {}) click to toggle source

Create a new mbox given a string representing the contents of an mbox file. Spam related X- headers will be stripped from the messages' headers unless the opts parameter has :strip_buryspam_hdrs set to false.

# File buryspam.rb, line 3719
def initialize(contents, opts = {})
  params = { :strip_buryspam_hdrs => true }.merge(opts)

  raise "Mbox has no contents." if contents.nil? || contents.empty?

  first_line = contents[/.*\n/] || "<no input lines>"
  unless POSTMARK.match(first_line)
    raise "Invalid mbox: '#{first_line.chomp}...'"
  end

  # Shift out the first (empty) element.
  (msgs = contents.split(POSTMARK)).shift

  unless msgs.size % 2 == 0
    raise "Bad mbox format (missing a msg body?)"
  end

  @msgs = []
  @date_range = oldest = newest = nil

  Logger.debug("Separating into messages...")
  until msgs.empty? do
    msg = Message.new(msgs.shift + msgs.shift, params)
    next if msg.is_admin?
    newest = msg.time if newest.nil? || newest < msg.time
    oldest = msg.time if oldest.nil? || oldest > msg.time
    @msgs << msg
  end
  Logger.debug("Finished separation.")
  unless oldest.nil? || newest.nil?
    # Note: Use a builtin +Range+ object and not a +DateRange+
    # because this range is stored in the Bayesian database and
    # would therefore require always including +DateRange+ when
    # loading that database.
    @date_range = Range.new(oldest, newest)
  end
end
read(opts = {}) { |mbox| ... } click to toggle source

If there is input on standard input, then read it and create/yield a corresponding mbox. Otherwise, read the contents of each file on the command line and create/yield an mbox for each one. (Yielding is done to reduce memory consumption if a long list of mboxes is specified on the command line, for example.)

# File buryspam.rb, line 3609
def read(opts = {})
  # If any input is available on stdin, then let that take priority.
  if $stdin.has_input? || ARGV.empty?
    Logger.debug("Reading from stdin...");
    input = $stdin.binread
    Logger.debug("Read %d byte%s from stdin.".pluralize(input.size))
    yield Mbox.new(input, opts)
  else
    # Read mboxes specified on the command line.
    ARGV.each { |file|
      mbox = read_file(file, opts)
      # Raising an exception seems too drastic if the file is not a valid
      # mbox, especially if there are other valid filenames later on the
      # command line.  Just log a message and continue to the next file.
      if mbox.nil?
        Status.error("'#{file}' not a valid mbox.")
        next
      end
      yield mbox
    }
  end
end
read_file(filename, opts = {}) click to toggle source

Read the contents of a mbox from the specified file. Returns an Mbox object or nil if the specified file is not a valid mbox.

# File buryspam.rb, line 3634
def read_file(filename, opts = {})
  unless Mbox.is_valid?(filename)
    Status.error("'#{filename}' not a valid mbox.")
    return nil
  end
  contents = nil
  Lockfile.open(filename) { |f| contents = f.read }
  Logger.debug("Read %d byte%s from '#{filename}'.".pluralize(contents.size))
  return Mbox.new(contents, opts)
end
rotate(filename, msg_size, max_size, time = nil) click to toggle source

Rename the mbox represented by filename if it will be larger than max_size if a message of size msg_size is added to it. Append a date represented by the time parameter to the rotated file name. This method is used to rename the archive and spam mboxes.

# File buryspam.rb, line 3690
def rotate(filename, msg_size, max_size, time = nil)
  Logger.debug("About to rotate '#{filename}'...")
  unless File.file?(filename)
    Logger.warn("'#{filename}' not a file.  No rotation done.")
    return
  end
  file_size = File.size(filename)
  Logger.debug("File size     : %9d" % file_size)
  Logger.debug("Message size  : %9d" % msg_size)
  Logger.debug("New file size : %9d" % (file_size + msg_size))
  Logger.debug("Maximum size  : %9d" % max_size)
  if file_size + msg_size <= max_size
    Logger.debug("'#{filename}' not yet large enough to rotate.")
    return
  end
  if time.nil?
    postmark = Lockfile.open(filename) { |f| f.gets }
    time     = Message.extract_time(postmark)
  end

  new_filename = "%s-%s" % [filename, time.strftime("%Y-%m-%d")]
  FileUtils.rename_file_uniq(filename, new_filename)
end
total_count(time_counts) click to toggle source

Convert a time-indexed based count to a total count.

# File buryspam.rb, line 3662
def total_count(time_counts)
  return if time_counts.nil?
  new_counts = { :total => init_counts }
  total_count = new_counts[:total]
  time_counts.each { |time, counts|
    total_count[:num_msgs] += counts.delete(:num_msgs)
    counts.each { |item, cnt|
      total_count[item].merge!(cnt) { |k, n1, n2|
        n1 + n2
      }
    }
  }
  new_counts
end

Public Instance Methods

counts(count_type) click to toggle source

Return a hash containing the counts of messages, words and IP address in the Mbox.

The structure of the cache is slightly different depending upon whether the ::count_type is :total,

{:total=>
  {:num_msgs=>2223,
   :ipaddrs=>
    {"201.239.80.116"=>1,
     "77.246.104.147"=>3,
     ...
     "83.96.36.53"=>1,
     "81.23.122.146"=>6},
   :words=>
    {"85C3210E77A"=>1,
     "4A1E3100DAE"=>1,
     ...
     "x5SJ"=>1,
     "129D310054D"=>1}}}

or :times,

{Thu Jul 03 08:36:15 -0230 2008=>
  {:num_msgs=>1,
   :ipaddrs=>
    {"127.0.0.1"=>3,
     "134.153.48.18"=>1,
     ...
     "134.153.232.77"=>1},
   :words=>
    {"H"=>1,
     "m63B2uKZ012666"=>1,
     ...
     "1.0"=>1,
     "To:"=>1}},
 ...
 Wed Jul 02 17:50:34 -0230 2008=>
  {:ipaddrs=>
    {"127.0.0.1"=>9,
     ...
     "78.183.15.215"=>1,
     "89.137.220.134"=>1},
   :num_msgs=>3,
   :words=>
    {"8EB0B100F30"=>1,
     "Return-Path:"=>3,
     ...
     "koi8-r"=>1,
     "amavisd-new"=>6}}}
# File buryspam.rb, line 3829
def counts(count_type)
  return if count_type.nil?

  counts = {}
  if count_type == :total
    count = counts[count_type] = Mbox.init_counts(@msgs.size)
  end

  Progress.new(@msgs, :each) { |msg|
    if count_type == :times
      count = counts[msg.time] ||= Mbox.init_counts
      count[:num_msgs] += 1
    end

    COUNTERS.each { |item|
      cnt = msg.send(item)
      count[item].merge!(cnt) { |k, n1, n2|
        n1 + n2
      }
    }
  }
  counts
end
each_msg() { |msg| ... } click to toggle source

Loop over each message in the Mbox and yield them to the calling code.

# File buryspam.rb, line 3855
def each_msg
  @msgs.each { |msg|
     yield(msg)
  }
end
filter() { |msg, is_spam| ... } click to toggle source

Test all the messages in the mbox for spam. If a block is passed, then yield each processed message to the block along with a boolean indicating whether or not the message was spam. Returns a list of all the filtered messages.

# File buryspam.rb, line 3877
def filter(&block)
  Logger.debug("Filtering %d message%s...".pluralize(num_msgs))

  # Accumulate the filtered messages and return them as an
  # array when they've all been filtered.
  msgs = []

  @msgs.each { |msg|
    Logger.debug(msg.postmark)
    Logger.debug(msg.hdr_field('Subject') || "(no subject)")
    Logger.debug("Message size: %d byte%s.".pluralize(msg.size))

    is_spam = false
    unless Config.filter_date_range.cover?(msg.time)
      Logger.debug("Message not in time range -- not filtered.")
    else
      begin
        Logger.debug("Starting spam test.")
        is_spam = msg.is_spam?
        # If the message is spam, see if it's time to rotate spam mbox.
        Spam.instance.rotate(msg.size) if is_spam
        Logger.info("#{is_spam ? "   " : "NOT"} SPAM: #{msg.postmark}")
      rescue Bayesian::DatabaseError
        # If we can't find the Bayesian database, then assume non-spam.
        Status.error($!.message)
        is_spam = false
      end
    end

    msgs << msg
    yield(msg, is_spam) if block_given?
    Logger.debug("---")
  }
  Logger.debug("Finished filtering.")

  msgs
end
metadata(gb, init_date_range, mbox_mtime) click to toggle source

Returns a hash containing various metadata about the mbox which is stored in the meta cache. This metadata is used to speed up initialization of the filter.

# File buryspam.rb, line 3760
def metadata(gb, init_date_range, mbox_mtime)
  {
    :type            => gb,
    :ver             => VER,
    :mtime           => mbox_mtime,
    :date_range      => @date_range,
    :num_msgs        => num_msgs,
    :count_type      => Mbox.count_type(init_date_range,
                                        @date_range, num_msgs)
  }.merge(  # Fold in the relevant configuration parameter settings, too.
    Hash[
     *Config::META_CACHE.map { |key|
        [key.to_sym, Config.send(key)]
      }.flatten
    ]
  )
end
num_msgs() click to toggle source

Returns the number of messages in the Mbox.

# File buryspam.rb, line 3862
def num_msgs
  @msgs.size
end
to_s() click to toggle source

Convert the mbox to a giant string.

# File buryspam.rb, line 3867
def to_s
  @msgs.map { |msg|
    msg.to_s
  }.join("")
end