class Buryspam::Message

Represents a message from an mbox file. This class contains various methods for processing MIME messages with base64/quoted-printable parts. Also contains methods for extracting and analyzing words and IP address for Bayesian filtering of the message.

Constants

ADMIN_BODY
ADMIN_SUBJECT: 'Administrative' messages are messages that have been created by Pine inside an mbox. These messages should be ignored.
DECODERS: Map Content-Transfer-Encodings to their corresponding decoders. NOTE: String#unpack("M") isn't aggressive enough for QP decoding, so we borrow from elsewhere.
ENC_WORD: Regular expression used to help decoded encoded headers (RFC 2047)
HTML_FTR: HTML header appended to colourized message output. See colourize.
HTML_HDR: HTML header used to to prefix colourized message output. See colourize.
HTTP_HDRS: When visiting URLS in messages during filtering, some sites require a non-empty User-Agent header before they'll let you in.
HttpResults: Simple structure to store the html text and content type retrieved by visiting URIs in a message.
INTERESTING_FIELDS
IP_ADDR
IP_NONROUTABLE
IS_MULTIPART
KEY_VAL
NEUTRAL
OCTET
RECEIVED
RECEIVED_LEN
SPAM_HDR: The header line added to all messages filtered by this script. (The X_HDR_PFX prefix will also be added.)
STRIP_HDRS: Strip other spam filter header lines which can mislead our filter. (Don't accidentally remove destroy double newline at end of header, so remove newline from beginning instead). Handles header lines that have been folded.
URI_RE: Regular expression to identify URIs in a message body.
X_HDR_PFX: The prefix to prepend to all header lines added by buryspam.

Attributes

postmark[R]

time[R]

Public Class Methods

extract_time(postmark) click to toggle source

Extract the date/time from the message's postmark ('From ') line. Return the current time if we couldn't parse a time from the postmark.

# File buryspam.rb, line 2824
def self.extract_time(postmark)
  begin
    raise unless Mbox::POSTMARK_DATE.match(postmark)
    return Time.parse($1)
  rescue
    Status.warn("Bad time in message postmark:\n '#{postmark}'")
    # If parse failed or no date in postmark, then default to current time.
    Time.now
  end
end

new(contents, opts = {}) click to toggle source

Create a new message object from the contents parameter, which is assumed to contain an RFC 2822 message in string format. If the opts hash parameter has :strip_buryspam_hdrs set to true, then X- spam related header lines will be stripped from the message header.

# File buryspam.rb, line 2785
def initialize(contents, opts = {})
  # Assume we are processing a top-level message (i.e., not a message
  # subpart), unless opts[:top_level] is false.
  @params = { :top_level => true }.merge(opts)

  if @params[:top_level]
    # Note: @postmark is an instance variable because the Mbox class
    #       uses it for logging purposes during filtering.
    @postmark = contents[/.*/]
    if Mbox::POSTMARK.match(@postmark)
      @time = Message.extract_time(@postmark)
    else
      raise "'From ' line invalid:\n\t'#{@postmark}'"
    end
  end

  if contents.strip.empty?
    raise "Empty message.\n#{contents.inspect}"
  end

  # Don't check for header/body separater
  # (i.e., contents[/\n\n/].nil?) because not
  # all messages have it!

  @hdr, @bdy = split_msg(contents)
  if @params[:strip_buryspam_hdrs]
    @hdr.gsub!(STRIP_HDRS, "")
  end

  @unf_hdr = @hdr.gsub(/\n\s+/, " ")
end

Public Instance Methods

add_hdrs(*args) click to toggle source

Add header lines to the message. Each argument is a header line ("hdr-field: hdr-value") to be added to the header. The X_HDR_PFX prefix is prepended to each header field. If verbose_hdr is false, then only let the X-Buryspam-Spam: header be added.

# File buryspam.rb, line 2946
def add_hdrs(*args)
  hdrs = args.map{ |hdr|
    hdr.strip!
    SPAM_HDR.match(hdr) || Config.verbose_hdr ? X_HDR_PFX + hdr : nil
  }.compact.join("\n")
  # Insert the new header lines before the double-newline.
  @hdr[@hdr.index("\n\n"), 0] = "\n#{hdrs}"
end

colourize() click to toggle source

Return an HTML version of the message with words colourized according to their probability. Good words are green and bad words are red. The further away the word probabilities are from neutral, the more intense the colour.

# File buryspam.rb, line 2997
def colourize
  Logger.debug(postmark)
  Logger.debug(hdr_field('Subject'))
  # (re)run the spam test so that the probabilities listed in the
  # X-Buryspam-Words: line are consistent.  This is a bit wasteful,
  # since it does a decode and the decoding is repeated below...
  # Perhaps we should cache the decoded messages?
  is_spam?

  html = ""
  # After decoding the message, fold lines that are too long, but allow line
  # breaks only between non-printable or white-space characters.
  contents = decode.gsub(/(.{1,80})([^[:print:]]|\s)/) do
    "#$1\n" + ($2 == "\n" ? "" : " #$2")
  end
  cache = {}
  # Create a new regex based upon the configuration's word_regex
  # that will capture the non-word elements, too.
  re = Config.word_regex
  nre = Regexp.new("(" + re.source + ")", re.options)

  max_dist = (0.5 - Config.bad_prob).abs

  contents.split(nre).each { |w|
    if cache.has_key?(w)
      html << cache[w]
      next
    end

    p = Bayesian.db[:word_probs].fetch(w, nil)

    # Don't colourize a word if it's probability hasn't already been
    # converted to a float by the get_slots method during filtering.
    # This prevents colorizing words that we're not in the original
    # message (e.g., new X-Buryspam-* header lines added by filter).
    if p.nil? || p.class != Float
      cache[w] = conv_html_chars(w)
      html << cache[w]
      next
    end

    pcnt = ((0.5 - p).abs / max_dist * 100).round
    style = "color: %s; background-color: rgb(%s%%, %s%%, 0%%)" %
              (p < 0.5 ? ["black", 0, pcnt] :
                         ["white", pcnt, 0])

    if (p - Config.bad_prob).abs  < 1e-6 ||
       (p - Config.good_prob).abs < 1e-6
      style << "; text-decoration: blink"
    end

    cache[w] = %Q{<a href="#" title="#{w}(#{p})"><span style="#{style}">} +
               conv_html_chars(w) + "</span></a>"
    html << cache[w]
  }
  html
end

decode() click to toggle source

Returns the message with all quoted-printable and base64 parts and header lines decoded. This does not affect the underlying message itself. (a message copy is created).

# File buryspam.rb, line 2925
def decode
  @hdr_attr = parse_header
  # Make copies of the header and body so that we do NOT modify
  # @hdr/@bdy members during decoding.
  hdr, bdy = decode_hdr(@hdr.dup), @bdy.dup
  hdr, bdy = is_multipart? ?
               decode_multi_part(hdr, bdy) :
               decode_single_part(hdr, bdy)
  msg = hdr + bdy
  if @params[:top_level]
    # Ensure message has at least two newlines at the end.
    nls = msg[/\n{0,2}\z/]
    msg << "\n" * (2 - nls.length) if nls.length < 2
  end
  msg
end

get_uri(uri_str) click to toggle source

Return the HTML and Content-Type of the URI represented by uri_str.

# File buryspam.rb, line 2902
def get_uri(uri_str)
  # Suppress 'warning: using default DH parameters' when getting https://...
  v, $VERBOSE = $VERBOSE, nil
  # Try open-uri first because it handles redirection.
  uri = open(uri_str, HTTP_HDRS)
  return nil if uri.nil?
  return HttpResults.new(uri.binread, uri.content_type)
  rescue RuntimeError
    # Fall-back to net/http on redirection loops...
    raise unless /^HTTP redirection loop: (.*)/.match($!.message)
    uri = URI.parse($1)
    req = Net::HTTP::Get.new(uri.path, HTTP_HDRS)
    res = Net::HTTP.start(uri.host, uri.port) { |http|
      http.request(req)
    }
    return HttpResults.new(res.body, res.content_type)
  ensure
    $VERBOSE = v
end

hdr_field(field) click to toggle source

If the message has a header line with the specified field, then return the entire line. Otherwise, nil.

# File buryspam.rb, line 2957
def hdr_field(field)
  md = @unf_hdr.match(/^#{field}\s*:.*/)
  return nil if md.nil?
  return md[0]
end

ipaddrs() click to toggle source

Extract all the IP address from the message's Received header lines and returns them in a hash.

{
  "12.34.56.78" => 2,
  "98.76.54.32" => 1,
  ...
}

# File buryspam.rb, line 2842
def ipaddrs
  ips = Hash.new(0)
  @unf_hdr.each_line { |line|
    # For efficiency, avoid using regexp.
    next unless line[0,RECEIVED_LEN].casecmp(RECEIVED) == 0
    line.scan(IP_ADDR) {
      ips[$1] += 1
    }
  }
  ips.delete_if { |ip, count| ip.match(IP_NONROUTABLE) }
end

is_admin?() click to toggle source

Return true if this message is the PINE Admininstraive message.

# File buryspam.rb, line 2818
def is_admin?
  ADMIN_SUBJECT.match(@hdr) && ADMIN_BODY.match(@bdy)
end

is_spam?() click to toggle source

Determine if the message is spam by extracting the interesting words from the message and performing Bayesian/ip-octet tests on it. Header lines are attached to the message as appropriate. Returns true if message is spam, false otherwise.

# File buryspam.rb, line 2977
def is_spam?
  Logger.debug("Extracting word samples...")
  # NOTE: 'words' is a method that returns a hash.
  w = words.keys
  @samples = get_samples(w)
  wp = words_hdr
  Logger.debug("Selected message word samples: #{wp}")
  add_hdrs("Words: #{wp}")

  spam = test_bayesian_spam || test_ip_octets_spam || test_website_spam(w)
  add_hdrs("Spam: " + (spam ? 'Yes' : 'No'))
  # No longer needed.  Let it be freed.
  @samples = nil
  spam
end

size() click to toggle source

Return the number of bytes in the (undecoded) message.

# File buryspam.rb, line 2969
def size
  to_s.size
end

to_s() click to toggle source

Return the message in string form.

# File buryspam.rb, line 2964
def to_s
  @hdr + @bdy
end

visit_urls() click to toggle source

Extract the URLs in the (decoded) message, retrieve their webpage contents and break them into their constituent words, according to Config.word_regex. Ignore links to content-types that are undecodable according to the Config. Return the list of words without duplicates.

# File buryspam.rb, line 2875
def visit_urls
  contents = ""
  visited = {}
  begin
    Timeout.timeout(Config.test_msg_urls_timeout) {
      decode.scan(URI_RE) { |uri_str|
        next if visited.include?(uri_str)
        visited[uri_str] = true
        Logger.debug("URI: #{uri_str}")
        result = get_uri(uri_str)
        if result.nil?
          Logger.warn("Cannot open URI: '#{uri_str}'")
        elsif Config.undecodable =~ result.content_type
          Logger.debug("Ignoring '#{result.content_type}' uri.")
        else
          Logger.debug("Read #{result.html.size} bytes")
          contents << result.html
        end
      }
    }
  rescue Exception
    Logger.warn($!.message)
  end
  contents.scan(Config.word_regex).uniq
end

words() click to toggle source

Extract all the words from the decoded message that match the word_regex configuration parameter from the message and returns them in a hash.

{
  "Hello" => 1,
  "VIAGRA" => 5,
  "$500.00" => 3,
  ...
}

# File buryspam.rb, line 2862
def words
  wrds = Hash.new(0)
  decode.scan(Config.word_regex) { |wrd|
    wrds[wrd] += 1
  }
  wrds
end

Private Instance Methods

backlisted_octets() click to toggle source

Returns a list containing the IP addresses that have blacklisted octets and the number of octets that were blacklisted. e.g.,

[["212.113.174.31", 3], ["10.137.130.49", 3]]

# File buryspam.rb, line 3368
def backlisted_octets
  bad_ips = ipaddrs.keys - Bayesian.db[:whitelist]
  bad_ips.map { |ip|
    num_octets = matching_octets(ip)
    num_octets.zero? ? nil : [ip, num_octets]
  }.compact
end

conv_html_chars(str) click to toggle source

Convert special HTML markup characters to their correpsonding entities.

# File buryspam.rb, line 3059
def conv_html_chars(str)
  str.gsub('&', '&amp;').gsub('>', '&gt;').gsub('<', '&lt;')
end

decode_hdr(hdr) click to toggle source

Decode header lines (e.g., Subject:, From:) that have been encoded (RFC 2047).

# File buryspam.rb, line 3134
def decode_hdr(hdr)
  hdr.gsub(/#{ENC_WORD}(?:\s*\n\s+#{ENC_WORD})*/) { |enc_words|
    dec_words = ""
    enc_words.scan(/#{ENC_WORD}/) { ||
      dec_words << ($1.downcase == "q" ?
        $2.gsub("_", " ").unpack("M").first :
        $2.unpack("m").first)
    }
    # If all of the characters are printable, then use
    # the decoded version.  Otherwise, fallback to the
    # original encoded form.
    #
    #/\A[[:print:]]+\z/.match(dec_words) ? dec_words : enc_words
    dec_words
  }.force_encoding(Encoding::BINARY)
end

decode_multi_part(hdr, bdy) click to toggle source

Returns a decoded multipart message header and body.

# File buryspam.rb, line 3152
def decode_multi_part(hdr, bdy)
  boundary = Regexp.escape(@hdr_attr[:boundary])
  parts = bdy.split(/((?:\n|^)--#{boundary}(?:--)?(?:\n|$))/)
  return hdr, bdy if parts.empty?  # Should never happen

  # First part is stuff between header and first boundary
  # (may be empty)
  bdy = parts.shift

  until parts.empty?
    boundary = parts.shift
    contents = ""
    unless parts.empty?
      part = parts.shift
      if part.strip.empty? || part[0] == \n\
        contents = part
      else
        msg = Message.new(part, :top_level => false)
        contents = msg.decode
      end
    end
    bdy << boundary + contents
  end
  return hdr, bdy
end

decode_single_part(hdr, bdy) click to toggle source

Returns a decoded single part message header and body. Do not decode the message if the Content-Type is undecodable as determined by the configuration.

# File buryspam.rb, line 3192
def decode_single_part(hdr, bdy)
  return hdr, "<...>" if Config.undecodable =~ @hdr_attr[:content_type]
  enc = @hdr_attr[:content_transfer_encoding]
  if DECODERS.has_key?(enc)
    hdr = hdr.sub(/^(Content-Transfer-Encoding:) .*?#{enc}$/i,
              '\1 8bit')
    bdy = DECODERS[enc][bdy]
    bdy = bdy.force_encoding(Encoding::BINARY)
  end
  return hdr, decode_urls(bdy)
end

decode_urls(bdy) click to toggle source

Decode URLs present in a message body. %XX strings are replaced with their character equivalents.

# File buryspam.rb, line 3180
def decode_urls(bdy)
  return "" if bdy.nil?
  bdy.gsub(URI_RE) { |url|
    url.gsub(/%([A-F\d]{2})/) {
      $1.hex.chr(Encoding::BINARY)
    }
  }
end

get_good_bad(slot) click to toggle source

Return a list containing the good probability, good words, bad probability and bad words from the given slot.

# File buryspam.rb, line 3359
def get_good_bad(slot)
  dist, words = slot
  return NEUTRAL - dist.to_f, words[:good] || [],
         NEUTRAL + dist.to_f, words[:bad]  || []
end

get_samples(words) click to toggle source

Collect the good/bad words in the message that are furthest from neutrality. Return them in a list where each element is a sublist:

[[word, prob], [word, prob], ...]

The list will contain Config.num_word_samples elements.

# File buryspam.rb, line 3280
def get_samples(words)
  samples = []
  get_slots(words).each { |slot|
    num_remaining = Config.num_word_samples - samples.size

    good_prob, good_words, bad_prob, bad_words = get_good_bad(slot)

    # If the number of good/bad words will exceed the configured
    # number of word samples, then start subsampling.
    gws, bws = good_words.size, bad_words.size
    if gws + bws > num_remaining
      if good_prob == bad_prob || gws == 0
        # The only discriminating words left are either all
        # bad or neutral (i.e. 0.5 probability) -- pick enough
        # words to fill out the rest of the samples.
        bad_words = bad_words[0, num_remaining]
      elsif bws == 0
        # No bad words left, just pick out enough good words
        # fill out the samples.
        good_words = good_words[0, num_remaining]
      else
        # Add a proportional number of good/bad words to the samples list.
        # Round the number of good words so as to reduce false positives.
        Logger.debug("Sampling proportional number of good/bad words")
        ng = (gws.to_f / (gws + bws) * num_remaining).round
        nb = num_remaining - ng
        good_words = good_words[0, ng]
        bad_words = bad_words[0, nb]
        add_hdrs("Subsampling: %s(%s):%s(%s) => %s:%s" %
          [bws, bad_prob, gws, good_prob, bad_words.size, good_words.size])
      end
    end

    good_words.each { |word|
      samples << [word, good_prob]
    }
    bad_words.each { |word|
      samples << [word, bad_prob]
    }
    break if samples.size >= Config.num_word_samples
  }
  samples.sort_by { |word, prob| prob }
end

get_slots(words) click to toggle source

Given an array of words, return a sorted array with the list of words slotted according to their distance from neutrality. The good/bad words furthest from neutrality will be at the front of the list:

[
  [distance_from_neutral_prob1, {
      :good => [w1, w2, ...],
      :bad =>  [w1, w2, ...]
    }
  ],
  [distance_from_neutral_prob2, {
      :good => [w1, w2, ...],
      :bad  => [w1, w2, ...]
    }
  ],
  ...
]

# File buryspam.rb, line 3342
def get_slots(words)
  slots = {}
  words.each { |word|
    p1 = Bayesian.db[:word_probs][word] || Config.default_prob
    unless p1.class == Float
      Bayesian.db[:word_probs][word] = p1 = p1.to_f
    end
    gb = p1 < NEUTRAL ? :good : :bad
    d1 = "%.*f" % [Config.precision, (NEUTRAL - p1).abs]
    slots[d1] ||= {}
    (slots[d1][gb] ||= []) << word
  }
  slots.sort.reverse
end

is_multipart?() click to toggle source

Return a true value if the message is a multipart message.

# File buryspam.rb, line 3074
def is_multipart?
  return IS_MULTIPART.match(@hdr_attr[:content_type]) &&
         @hdr_attr[:boundary]
end

matching_octets(ip) click to toggle source

Given an IP address, determine the maximum number of octets it has in common with any blacklist ip address.

# File buryspam.rb, line 3378
def matching_octets(ip)
  octets = ip.split(".")
  blacklisted = Bayesian.db[:blacklist]
  num_octs = octets.each_with_index { |oct, idx|
    o = oct.to_i
    break idx unless blacklisted.has_key?(o)
    blacklisted = blacklisted[o]
    break idx+1 if blacklisted.class != Hash
  }
  num_octs
end

parse_header() click to toggle source

Extract lines from the header that are required for decoding the message. In particular, Content-Type, Content-Transfer-Encoding and Content-Disposition fields are extracted and returned in a hash. For example:

Content-Type: application/msword; name=CONFIDENTIAL.doc
Content-Transfer-Encoding: base64
X-Attachment-Id: f_fl566abb
Content-Disposition: attachment; filename=CONFIDENTIAL.doc

is extracted as:

{
   :content_type              => "application/msword",
   :name                      => "CONFIDENTIAL.doc",
   :content_transfer_encoding => "base64",
   :filename                  => "CONFIDENTIAL.doc",
   :content_disposition       => "attachment"
}

# File buryspam.rb, line 3115
def parse_header
  hdr_fields = {}
  @unf_hdr.each_line { |line|
    next unless INTERESTING_FIELDS.match(line)
    field, body = $1.downcase, $'.strip
    field = field.gsub(/-/, "_").to_sym
    next if body.strip.empty?
    value, parameters = body.split(/\s*;\s*/, 2)
    hdr_fields[field] = value.downcase
    params = parse_parameters(parameters)
    params.each { |attr, val|
      hdr_fields[attr.to_sym] = val
    }
  }
  hdr_fields
end

parse_parameters(param_str) click to toggle source

Convert strings of the form

key1="val1";key2="val2"...

into an appropriate hash. Used by the parse_header method.

# File buryspam.rb, line 3082
def parse_parameters(param_str)
  if param_str.nil? || param_str.strip.empty?
    return {}
  end
  params_list = param_str.split(/\s*;\s*/)
  params = params_list.map { |param|
    next unless KEY_VAL.match(param)
    [$1.downcase, $2]
  }
  params.compact!
  Hash[*params.flatten]
end

split_msg(contents) click to toggle source

Split the message into its header and body (i.e., at the first double newline).

# File buryspam.rb, line 3065
def split_msg(contents)
  hdr, bdy = contents.split(/\n\n/, 2)
  # Append double newline to the header so that we can add the body
  # later to make a complete message.
  (hdr ||= "") << "\n\n"
  return hdr, bdy || ""
end

test_bayesian_spam() click to toggle source

Determine if the message is spam by using computing the bayesian probablity for the collected word sample probabilities. Returns true if message tests positive for spam.

# File buryspam.rb, line 3207
def test_bayesian_spam
  bayesian_prob = Bayesian.value(@samples)
  add_hdrs("Bayesian-Value: #{"%g" % bayesian_prob}")
  Logger.debug("bayesian probability: #{bayesian_prob}")

  bayesian_prob > Config.spam_threshold
end

test_ip_octets_spam() click to toggle source

Use blacklisted IPs to determine if the message may have come from a spammer. Use the maximum number of octet matches to remove the good words and inject 'bad' words in the word sample list. Returns true if message tests positive for spam and false if the message was not deemed to be spam or if the octet testing feature was turned off in the configuration (as indicated by the test_octet_samples configuration parameter.

# File buryspam.rb, line 3222
def test_ip_octets_spam
  return false unless Config.test_octet_samples
  ip_octets = backlisted_octets
  if ip_octets.size <= 0
    Logger.debug("No blacklisted octets.")
    return false
  end

  ip_octets.inject(0) { |sum, ip_octs|
    sum + ip_octs.last
  }
  ip_oct_str = ip_octets.map { |ip, octs|
    "#{ip}(#{octs})"
  }.join(" ")

  Logger.debug("ips/(octets): #{ip_oct_str}")
  max_octs = ip_octets.collect { |ip, octs| octs }.max
  Logger.debug("max_octs: #{max_octs}")

  bad_sample = ["", Config.bad_prob]

  samples_max = (@samples + [bad_sample] * max_octs)[max_octs..-1]

  blacklisted_prob_max = Bayesian.value(samples_max)

  add_hdrs("Blacklisted-IP-Octets: #{ip_oct_str}",
     "Blacklisted-Value: #{"%g" % blacklisted_prob_max}")
  Logger.debug("blacklisted probability max: #{blacklisted_prob_max}")

  blacklisted_prob_max > Config.spam_threshold
end

test_website_spam(msg_words) click to toggle source

Extract all the URLs in a message, get the URLs' webpage contents, add those words to the collection of words and re-do the bayesian test. Return true if the message is spam or false if this feature (denoted by the test_msg_urls boolean configuration parameter) is false or if the message was deemed not to be spam.

# File buryspam.rb, line 3259
def test_website_spam(msg_words)
  return false unless Config.test_msg_urls
  Logger.debug("Visiting URLs...")
  web_page_words = visit_urls
  if web_page_words.empty?
    Logger.debug("No URLs or no web content available.")
    return false
  end
  @samples = get_samples(msg_words + web_page_words)
  wp = words_hdr
  Logger.debug("Selected word samples from message and website:\n#{wp}")
  add_hdrs("Message-Website-Words: #{wp}")

  Logger.debug("Retesting with web page words.")
  test_bayesian_spam || test_ip_octets_spam
end

words_hdr() click to toggle source

Convert the word => probability samples hash into a comma separated string of words/probabilities suitable for use in a message header.

# File buryspam.rb, line 3392
def words_hdr
  @samples.map { |word, prob|
    "%s(%.*f)" % [word, Config.precision, prob]
  }.join(" ")
end