Represents a message from an mbox file. This class contains various methods for processing MIME messages with base64/quoted-printable parts. Also contains methods for extracting and analyzing words and IP address for Bayesian filtering of the message.
'Administrative' messages are messages that have been created by Pine inside an mbox. These messages should be ignored.
Map Content-Transfer-Encodings to their corresponding decoders. NOTE: String#unpack("M") isn't aggressive enough for QP decoding, so we borrow from elsewhere.
Regular expression used to help decoded encoded headers (RFC 2047)
HTML header appended to colourized message output. See colourize.
HTML header used to to prefix colourized message output. See colourize.
When visiting URLS in messages during filtering, some sites require a non-empty User-Agent header before they'll let you in.
Simple structure to store the html text and content type retrieved by visiting URIs in a message.
The header line added to all messages filtered by this script. (The
X_HDR_PFX
prefix will also be added.)
Strip other spam filter header lines which can mislead our filter. (Don't accidentally remove destroy double newline at end of header, so remove newline from beginning instead). Handles header lines that have been folded.
Regular expression to identify URIs in a message body.
The prefix to prepend to all header lines added by buryspam.
Extract the date/time from the message's postmark ('From '
)
line. Return the current time if we couldn't parse a time from the
postmark.
# File buryspam.rb, line 2824 def self.extract_time(postmark) begin raise unless Mbox::POSTMARK_DATE.match(postmark) return Time.parse($1) rescue Status.warn("Bad time in message postmark:\n '#{postmark}'") # If parse failed or no date in postmark, then default to current time. Time.now end end
Create a new message object from the contents parameter, which is
assumed to contain an RFC 2822 message in string format. If the
opts hash parameter has :strip_buryspam_hdrs
set to
true
, then X-
spam related header lines will be
stripped from the message header.
# File buryspam.rb, line 2785 def initialize(contents, opts = {}) # Assume we are processing a top-level message (i.e., not a message # subpart), unless opts[:top_level] is false. @params = { :top_level => true }.merge(opts) if @params[:top_level] # Note: @postmark is an instance variable because the Mbox class # uses it for logging purposes during filtering. @postmark = contents[/.*/] if Mbox::POSTMARK.match(@postmark) @time = Message.extract_time(@postmark) else raise "'From ' line invalid:\n\t'#{@postmark}'" end end if contents.strip.empty? raise "Empty message.\n#{contents.inspect}" end # Don't check for header/body separater # (i.e., contents[/\n\n/].nil?) because not # all messages have it! @hdr, @bdy = split_msg(contents) if @params[:strip_buryspam_hdrs] @hdr.gsub!(STRIP_HDRS, "") end @unf_hdr = @hdr.gsub(/\n\s+/, " ") end
Add header lines to the message. Each argument is a header line
("hdr-field: hdr-value
") to be added to the header. The
X_HDR_PFX
prefix is prepended to each header field. If
verbose_hdr
is false
, then only let the
X-Buryspam-Spam:
header be added.
# File buryspam.rb, line 2946 def add_hdrs(*args) hdrs = args.map{ |hdr| hdr.strip! SPAM_HDR.match(hdr) || Config.verbose_hdr ? X_HDR_PFX + hdr : nil }.compact.join("\n") # Insert the new header lines before the double-newline. @hdr[@hdr.index("\n\n"), 0] = "\n#{hdrs}" end
Return an HTML version of the message with words colourized according to their probability. Good words are green and bad words are red. The further away the word probabilities are from neutral, the more intense the colour.
# File buryspam.rb, line 2997 def colourize Logger.debug(postmark) Logger.debug(hdr_field('Subject')) # (re)run the spam test so that the probabilities listed in the # X-Buryspam-Words: line are consistent. This is a bit wasteful, # since it does a decode and the decoding is repeated below... # Perhaps we should cache the decoded messages? is_spam? html = "" # After decoding the message, fold lines that are too long, but allow line # breaks only between non-printable or white-space characters. contents = decode.gsub(/(.{1,80})([^[:print:]]|\s)/) do "#$1\n" + ($2 == "\n" ? "" : " #$2") end cache = {} # Create a new regex based upon the configuration's word_regex # that will capture the non-word elements, too. re = Config.word_regex nre = Regexp.new("(" + re.source + ")", re.options) max_dist = (0.5 - Config.bad_prob).abs contents.split(nre).each { |w| if cache.has_key?(w) html << cache[w] next end p = Bayesian.db[:word_probs].fetch(w, nil) # Don't colourize a word if it's probability hasn't already been # converted to a float by the get_slots method during filtering. # This prevents colorizing words that we're not in the original # message (e.g., new X-Buryspam-* header lines added by filter). if p.nil? || p.class != Float cache[w] = conv_html_chars(w) html << cache[w] next end pcnt = ((0.5 - p).abs / max_dist * 100).round style = "color: %s; background-color: rgb(%s%%, %s%%, 0%%)" % (p < 0.5 ? ["black", 0, pcnt] : ["white", pcnt, 0]) if (p - Config.bad_prob).abs < 1e-6 || (p - Config.good_prob).abs < 1e-6 style << "; text-decoration: blink" end cache[w] = %Q{<a href="#" title="#{w}(#{p})"><span style="#{style}">} + conv_html_chars(w) + "</span></a>" html << cache[w] } html end
Returns the message with all quoted-printable and base64 parts and header lines decoded. This does not affect the underlying message itself. (a message copy is created).
# File buryspam.rb, line 2925 def decode @hdr_attr = parse_header # Make copies of the header and body so that we do NOT modify # @hdr/@bdy members during decoding. hdr, bdy = decode_hdr(@hdr.dup), @bdy.dup hdr, bdy = is_multipart? ? decode_multi_part(hdr, bdy) : decode_single_part(hdr, bdy) msg = hdr + bdy if @params[:top_level] # Ensure message has at least two newlines at the end. nls = msg[/\n{0,2}\z/] msg << "\n" * (2 - nls.length) if nls.length < 2 end msg end
Return the HTML and Content-Type of the URI represented by uri_str.
# File buryspam.rb, line 2902 def get_uri(uri_str) # Suppress 'warning: using default DH parameters' when getting https://... v, $VERBOSE = $VERBOSE, nil # Try open-uri first because it handles redirection. uri = open(uri_str, HTTP_HDRS) return nil if uri.nil? return HttpResults.new(uri.binread, uri.content_type) rescue RuntimeError # Fall-back to net/http on redirection loops... raise unless /^HTTP redirection loop: (.*)/.match($!.message) uri = URI.parse($1) req = Net::HTTP::Get.new(uri.path, HTTP_HDRS) res = Net::HTTP.start(uri.host, uri.port) { |http| http.request(req) } return HttpResults.new(res.body, res.content_type) ensure $VERBOSE = v end
If the message has a header line with the specified field, then return the
entire line. Otherwise, nil
.
# File buryspam.rb, line 2957 def hdr_field(field) md = @unf_hdr.match(/^#{field}\s*:.*/) return nil if md.nil? return md[0] end
Extract all the IP address from the message's Received
header
lines and returns them in a hash.
{ "12.34.56.78" => 2, "98.76.54.32" => 1, ... }
# File buryspam.rb, line 2842 def ipaddrs ips = Hash.new(0) @unf_hdr.each_line { |line| # For efficiency, avoid using regexp. next unless line[0,RECEIVED_LEN].casecmp(RECEIVED) == 0 line.scan(IP_ADDR) { ips[$1] += 1 } } ips.delete_if { |ip, count| ip.match(IP_NONROUTABLE) } end
Return true
if this message is the PINE Admininstraive
message.
# File buryspam.rb, line 2818 def is_admin? ADMIN_SUBJECT.match(@hdr) && ADMIN_BODY.match(@bdy) end
Determine if the message is spam by extracting the interesting words from
the message and performing Bayesian/ip-octet tests on it. Header lines are
attached to the message as appropriate. Returns true
if
message is spam, false
otherwise.
# File buryspam.rb, line 2977 def is_spam? Logger.debug("Extracting word samples...") # NOTE: 'words' is a method that returns a hash. w = words.keys @samples = get_samples(w) wp = words_hdr Logger.debug("Selected message word samples: #{wp}") add_hdrs("Words: #{wp}") spam = test_bayesian_spam || test_ip_octets_spam || test_website_spam(w) add_hdrs("Spam: " + (spam ? 'Yes' : 'No')) # No longer needed. Let it be freed. @samples = nil spam end
Return the number of bytes in the (undecoded) message.
# File buryspam.rb, line 2969 def size to_s.size end
Return the message in string form.
# File buryspam.rb, line 2964 def to_s @hdr + @bdy end
Extract the URLs in the (decoded) message, retrieve their webpage contents
and break them into their constituent words, according to
Config.word_regex
. Ignore links to content-types that are
undecodable according to the Config
. Return the list of words
without duplicates.
# File buryspam.rb, line 2875 def visit_urls contents = "" visited = {} begin Timeout.timeout(Config.test_msg_urls_timeout) { decode.scan(URI_RE) { |uri_str| next if visited.include?(uri_str) visited[uri_str] = true Logger.debug("URI: #{uri_str}") result = get_uri(uri_str) if result.nil? Logger.warn("Cannot open URI: '#{uri_str}'") elsif Config.undecodable =~ result.content_type Logger.debug("Ignoring '#{result.content_type}' uri.") else Logger.debug("Read #{result.html.size} bytes") contents << result.html end } } rescue Exception Logger.warn($!.message) end contents.scan(Config.word_regex).uniq end
Extract all the words from the decoded message that match the
word_regex
configuration parameter from the message and
returns them in a hash.
{ "Hello" => 1, "VIAGRA" => 5, "$500.00" => 3, ... }
# File buryspam.rb, line 2862 def words wrds = Hash.new(0) decode.scan(Config.word_regex) { |wrd| wrds[wrd] += 1 } wrds end
Returns a list containing the IP addresses that have blacklisted octets and the number of octets that were blacklisted. e.g.,
[["212.113.174.31", 3], ["10.137.130.49", 3]]
# File buryspam.rb, line 3368 def backlisted_octets bad_ips = ipaddrs.keys - Bayesian.db[:whitelist] bad_ips.map { |ip| num_octets = matching_octets(ip) num_octets.zero? ? nil : [ip, num_octets] }.compact end
Convert special HTML markup characters to their correpsonding entities.
# File buryspam.rb, line 3059 def conv_html_chars(str) str.gsub('&', '&').gsub('>', '>').gsub('<', '<') end
Decode header lines (e.g., Subject:, From:) that have been encoded (RFC 2047).
# File buryspam.rb, line 3134 def decode_hdr(hdr) hdr.gsub(/#{ENC_WORD}(?:\s*\n\s+#{ENC_WORD})*/) { |enc_words| dec_words = "" enc_words.scan(/#{ENC_WORD}/) { || dec_words << ($1.downcase == "q" ? $2.gsub("_", " ").unpack("M").first : $2.unpack("m").first) } # If all of the characters are printable, then use # the decoded version. Otherwise, fallback to the # original encoded form. # #/\A[[:print:]]+\z/.match(dec_words) ? dec_words : enc_words dec_words }.force_encoding(Encoding::BINARY) end
Returns a decoded multipart message header and body.
# File buryspam.rb, line 3152 def decode_multi_part(hdr, bdy) boundary = Regexp.escape(@hdr_attr[:boundary]) parts = bdy.split(/((?:\n|^)--#{boundary}(?:--)?(?:\n|$))/) return hdr, bdy if parts.empty? # Should never happen # First part is stuff between header and first boundary # (may be empty) bdy = parts.shift until parts.empty? boundary = parts.shift contents = "" unless parts.empty? part = parts.shift if part.strip.empty? || part[0] == \n\ contents = part else msg = Message.new(part, :top_level => false) contents = msg.decode end end bdy << boundary + contents end return hdr, bdy end
Returns a decoded single part message header and body. Do not decode the
message if the Content-Type
is undecodable as determined by
the configuration.
# File buryspam.rb, line 3192 def decode_single_part(hdr, bdy) return hdr, "<...>" if Config.undecodable =~ @hdr_attr[:content_type] enc = @hdr_attr[:content_transfer_encoding] if DECODERS.has_key?(enc) hdr = hdr.sub(/^(Content-Transfer-Encoding:) .*?#{enc}$/i, '\1 8bit') bdy = DECODERS[enc][bdy] bdy = bdy.force_encoding(Encoding::BINARY) end return hdr, decode_urls(bdy) end
Decode URLs present in a message body. %XX
strings are
replaced with their character equivalents.
# File buryspam.rb, line 3180 def decode_urls(bdy) return "" if bdy.nil? bdy.gsub(URI_RE) { |url| url.gsub(/%([A-F\d]{2})/) { $1.hex.chr(Encoding::BINARY) } } end
Return a list containing the good probability, good words, bad probability and bad words from the given slot.
# File buryspam.rb, line 3359 def get_good_bad(slot) dist, words = slot return NEUTRAL - dist.to_f, words[:good] || [], NEUTRAL + dist.to_f, words[:bad] || [] end
Collect the good/bad words in the message that are furthest from neutrality. Return them in a list where each element is a sublist:
[[word, prob], [word, prob], ...]
The list will contain Config.num_word_samples
elements.
# File buryspam.rb, line 3280 def get_samples(words) samples = [] get_slots(words).each { |slot| num_remaining = Config.num_word_samples - samples.size good_prob, good_words, bad_prob, bad_words = get_good_bad(slot) # If the number of good/bad words will exceed the configured # number of word samples, then start subsampling. gws, bws = good_words.size, bad_words.size if gws + bws > num_remaining if good_prob == bad_prob || gws == 0 # The only discriminating words left are either all # bad or neutral (i.e. 0.5 probability) -- pick enough # words to fill out the rest of the samples. bad_words = bad_words[0, num_remaining] elsif bws == 0 # No bad words left, just pick out enough good words # fill out the samples. good_words = good_words[0, num_remaining] else # Add a proportional number of good/bad words to the samples list. # Round the number of good words so as to reduce false positives. Logger.debug("Sampling proportional number of good/bad words") ng = (gws.to_f / (gws + bws) * num_remaining).round nb = num_remaining - ng good_words = good_words[0, ng] bad_words = bad_words[0, nb] add_hdrs("Subsampling: %s(%s):%s(%s) => %s:%s" % [bws, bad_prob, gws, good_prob, bad_words.size, good_words.size]) end end good_words.each { |word| samples << [word, good_prob] } bad_words.each { |word| samples << [word, bad_prob] } break if samples.size >= Config.num_word_samples } samples.sort_by { |word, prob| prob } end
Given an array of words, return a sorted array with the list of words slotted according to their distance from neutrality. The good/bad words furthest from neutrality will be at the front of the list:
[ [distance_from_neutral_prob1, { :good => [w1, w2, ...], :bad => [w1, w2, ...] } ], [distance_from_neutral_prob2, { :good => [w1, w2, ...], :bad => [w1, w2, ...] } ], ... ]
# File buryspam.rb, line 3342 def get_slots(words) slots = {} words.each { |word| p1 = Bayesian.db[:word_probs][word] || Config.default_prob unless p1.class == Float Bayesian.db[:word_probs][word] = p1 = p1.to_f end gb = p1 < NEUTRAL ? :good : :bad d1 = "%.*f" % [Config.precision, (NEUTRAL - p1).abs] slots[d1] ||= {} (slots[d1][gb] ||= []) << word } slots.sort.reverse end
Return a true
value if the message is a multipart message.
# File buryspam.rb, line 3074 def is_multipart? return IS_MULTIPART.match(@hdr_attr[:content_type]) && @hdr_attr[:boundary] end
Given an IP address, determine the maximum number of octets it has in common with any blacklist ip address.
# File buryspam.rb, line 3378 def matching_octets(ip) octets = ip.split(".") blacklisted = Bayesian.db[:blacklist] num_octs = octets.each_with_index { |oct, idx| o = oct.to_i break idx unless blacklisted.has_key?(o) blacklisted = blacklisted[o] break idx+1 if blacklisted.class != Hash } num_octs end
Extract lines from the header that are required for decoding the message.
In particular, Content-Type
,
Content-Transfer-Encoding
and Content-Disposition
fields are extracted and returned in a hash. For example:
Content-Type: application/msword; name=CONFIDENTIAL.doc Content-Transfer-Encoding: base64 X-Attachment-Id: f_fl566abb Content-Disposition: attachment; filename=CONFIDENTIAL.doc
is extracted as:
{ :content_type => "application/msword", :name => "CONFIDENTIAL.doc", :content_transfer_encoding => "base64", :filename => "CONFIDENTIAL.doc", :content_disposition => "attachment" }
# File buryspam.rb, line 3115 def parse_header hdr_fields = {} @unf_hdr.each_line { |line| next unless INTERESTING_FIELDS.match(line) field, body = $1.downcase, $'.strip field = field.gsub(/-/, "_").to_sym next if body.strip.empty? value, parameters = body.split(/\s*;\s*/, 2) hdr_fields[field] = value.downcase params = parse_parameters(parameters) params.each { |attr, val| hdr_fields[attr.to_sym] = val } } hdr_fields end
Convert strings of the form
key1="val1";key2="val2"...
into an appropriate hash. Used by the parse_header
method.
# File buryspam.rb, line 3082 def parse_parameters(param_str) if param_str.nil? || param_str.strip.empty? return {} end params_list = param_str.split(/\s*;\s*/) params = params_list.map { |param| next unless KEY_VAL.match(param) [$1.downcase, $2] } params.compact! Hash[*params.flatten] end
Split the message into its header and body (i.e., at the first double newline).
# File buryspam.rb, line 3065 def split_msg(contents) hdr, bdy = contents.split(/\n\n/, 2) # Append double newline to the header so that we can add the body # later to make a complete message. (hdr ||= "") << "\n\n" return hdr, bdy || "" end
Determine if the message is spam by using computing the bayesian probablity
for the collected word sample probabilities. Returns true
if
message tests positive for spam.
# File buryspam.rb, line 3207 def test_bayesian_spam bayesian_prob = Bayesian.value(@samples) add_hdrs("Bayesian-Value: #{"%g" % bayesian_prob}") Logger.debug("bayesian probability: #{bayesian_prob}") bayesian_prob > Config.spam_threshold end
Use blacklisted IPs to determine if the message may have come from a
spammer. Use the maximum number of octet matches to remove the good words
and inject 'bad' words in the word sample list. Returns true
if message tests positive for spam and false
if the message
was not deemed to be spam or if the octet testing feature was turned off in
the configuration (as indicated by the test_octet_samples
configuration parameter.
# File buryspam.rb, line 3222 def test_ip_octets_spam return false unless Config.test_octet_samples ip_octets = backlisted_octets if ip_octets.size <= 0 Logger.debug("No blacklisted octets.") return false end ip_octets.inject(0) { |sum, ip_octs| sum + ip_octs.last } ip_oct_str = ip_octets.map { |ip, octs| "#{ip}(#{octs})" }.join(" ") Logger.debug("ips/(octets): #{ip_oct_str}") max_octs = ip_octets.collect { |ip, octs| octs }.max Logger.debug("max_octs: #{max_octs}") bad_sample = ["", Config.bad_prob] samples_max = (@samples + [bad_sample] * max_octs)[max_octs..-1] blacklisted_prob_max = Bayesian.value(samples_max) add_hdrs("Blacklisted-IP-Octets: #{ip_oct_str}", "Blacklisted-Value: #{"%g" % blacklisted_prob_max}") Logger.debug("blacklisted probability max: #{blacklisted_prob_max}") blacklisted_prob_max > Config.spam_threshold end
Extract all the URLs in a message, get the URLs' webpage contents, add
those words to the collection of words and re-do the bayesian test. Return
true
if the message is spam or false
if this
feature (denoted by the test_msg_urls
boolean configuration
parameter) is false or if the message was deemed not to be spam.
# File buryspam.rb, line 3259 def test_website_spam(msg_words) return false unless Config.test_msg_urls Logger.debug("Visiting URLs...") web_page_words = visit_urls if web_page_words.empty? Logger.debug("No URLs or no web content available.") return false end @samples = get_samples(msg_words + web_page_words) wp = words_hdr Logger.debug("Selected word samples from message and website:\n#{wp}") add_hdrs("Message-Website-Words: #{wp}") Logger.debug("Retesting with web page words.") test_bayesian_spam || test_ip_octets_spam end
Convert the word => probability samples hash into a comma separated string of words/probabilities suitable for use in a message header.
# File buryspam.rb, line 3392 def words_hdr @samples.map { |word, prob| "%s(%.*f)" % [word, Config.precision, prob] }.join(" ") end