Main

March 26, 2004 (Friday)

Regular Expressions (S&P — Chapters 7/8/9, cont'd)

Quantifiers

Other meta-characters can be used to represent quantifiers (or repetitions) of patterns. The three most common quantifiers are

QuantifierMeaning
* Zero or more occurrences
+ One or more occurrences
? Zero or one occurrence.

For example, the fairly popular regular expression /.*/ represents any number (include zero) of characters (excluding newline which is not, by default, matched by the dot meta-character). This regular expression can be used to match an arbitrary number of characters. For example, if you wish to match lines in file that have the character sequence hello and world on the same line (and in that order), you can use the regular expression /hello.*world/. This regular expression would match the strings:

The last string matches because .* allows for the possibility of no characters.

The + quantifier will match one or more occurrences of the preceding regular expression. For example, the regular expression /hello +world/ would match any line that had the string hello followed by at least one space followed by world

The ? quantifier will match exactly zero or one occurrence of the previous regular expression. For example, the regular expression /hello world!?/ would match any string containing the phrase hello world or hello world!. The exclamation mark is optional. The ? quantifier is sometimes referred to as the optional quantifier because of this.

Again, if you wish to match one of these quantifiers literally, use a backslash before them. For example, the regular expression: /\*.*\?.\+/ will match any string containing an asterisk followed by any number of characters, followed by a question mark followed by exactly one character followed by a plus sign.

Grouping using parentheses

We can use parentheses to group regular expressions together. This serves two purposes. The first purpose is to change precedence within regular expressions. For example, as we saw above, the regular expression hello +world would match strings containing hello followed by one or more spaces followed by world. What if we wanted to match a string containing one or more hellos followed by a space followed by world (e.g. hellohellohello world)? An incorrect way to do this would be to use the regular expression hello+ world. This is not right because the + quantifier binds to the entity immediately to the left of it. In this regular expression, that entity is the letter o. Therefore, this regular expression would match helloooooooo world, but not hellohello world, for example, which is what we wanted. To fix this we must group the entire sequence hello inside parentheses and then use the + quantifier: (hello)+ world.

For more details on the regular expression precedence, see page 111 of S&P.

The second, and very useful, reason for using parentheses in regular expressions is to extract (or remember) interesting parts of the string that matched a regular expression. This is especially useful when we use a regular expression match in list context, as we will see later on.

Alternation

Alternation allows us to specify a regular expression that can identify one of a number of possible patterns inside a string. For example, let's say we wanted to identify lines in a file that contained dates of of the form:

Tue Mar 23 17:15:13 NST 2004

We can generalize this into the following regular expression (a line break was introduced so that it will fit on the page. Ideally, the entire regular expression should be on a single line):

/(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|
May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (..) ..:..:.. (....)/

This regular expression uses the alternation symbol, |, in order to match the day of the week and the name of the month. Note that the parentheses around the days of the week and the month names are required because the precedence of the alternation symbol is very low. Had we omitted the parentheses, then the above regular expression would have been interpreted as (again, assume that the entire regular expression is contained on a single line):

/(Sun)|(Mon)|(Tue)|(Wed)|(Thu)|(Fri)|(Sat Jan)|(Feb)|(Mar)|(Apr)|
(May)|(Jun)|(Jul)|(Aug)|(Sep)|(Oct)|(Nov)|(Dec .. ..:..:.. ....)/

This regular expression would match any string containing Sun or Mon or Tue or etc. or Sat Jan or Feb etc. This is clearly not what we intended.

Our original regular expression is rather long. The following Perl script demonstrates how we can build up the regular expression incrementally:

#!/usr/bin/perl -w

use strict;

my @days = qw/Sun Mon Tue Wed Thu Fri Sat/;
my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;

my $day_re = join "|", @days;
my $month_re = join "|", @months;

while (<>) {
	if (/($day_re) ($month_re) (..) ..:..:.. (....)/) {
		print "Day Name: $1, Month $2, Day: $3, Year $4\n";
	}
}
date.pl

This script demonstrates a trivial way to build an alternation regular expression using an array and Perl's join function. When the while loop starts, the value of the $day_re variable will be Sun|Mon|Tue|Wed|Thu|Fri|Sat and the $month_re variable will be Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec. We can then have Perl interpolate these variables directly into the regular expression.

By grouping these regular expressions, Perl will remember the parts of the string that matched the parts of the regular expressions that were enclosed in parentheses. For example, if the target string being matched against the regular expression was:

  Last modified: Mon Mar 31 15:10:16 2004

Then Perl would remember that:

Perl will then store these remembered substrings in the special variables $1, $2, $3 and $4 respectively. We can then examine/use these variables inside the body of our conditional. These special variables are set whenever a regular expression successfully matches. Therefore, you should not use these variables unless you have actually tested that the regular expression matched first. Also, remember that because these variables will be overwritten on each successful match of a regular expression, introducing another regular expression match between your original regular expression match and your use of the special variables may cause unexpected results.

The regular expression in the above program lets some invalid dates through. For example, if a non-numeric year is specified, the regular expression will still match it, just as long as there are four characters at the end of the string that follow a space. We can tighten up this regular expression to reject these non-date strings by using character classes.

Character Classes

A character class provides a way for a regular expression to match one of a collection of characters. Character classes are denoted by the meta-characters [ and ]. For example, to represent the lowercase vowels, you can use the character class [aeiou] inside a regular expression. To represent the digits, you can use the character class [0123456789]. Perl allows you to specify a range of characters using the hyphen inside a character class. For example: [A-Za-z] represents all alphabetic characters. If you wish to use a hyphen inside a character class, then escape it with a backslash or use the hyphen first in the character class. For example [-az] and [a\-z] represent the character class containing the three characters a, - and z. Note that the special meta-characters (except backslash) lose their meaning inside a character class. Therefore the character class [?*+.] will match any one of those characters literally.

It is possible to negate the characters in a class by using the ^ symbol as the very first character inside a character class. For example, the class [^aeiou] will match any character that is not a lowercase vowel.

Perl has several predefined character classes for us to use. For example, the sequence \s represents the character class [ \t\r\n\f], that is, any whitespace character. The sequence \d represents any digit and the sequence \w represents any word character, which includes alphabetic characters (upper and lower case), digits and the underscore character. We can represent negations of these classes by capitalizing the letter after the backslash. For example, the character class \S can be used to represent any non-whitespace character. We can also use these character classes inside other character classes. For example, [\d ] will match either a digit or a space.

Returning to our date parsing script, we can use the digit character class to represent days of the month and the year:

#!/usr/bin/perl -w

use strict;

my @days = qw/Sun Mon Tue Wed Thu Fri Sat/;
my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;

my $day_re = join "|", @days;
my $month_re = join "|", @months;

while (<>) {
	if (/($day_re) ($month_re) ([\d ]\d) ..:..:.. (\d{4})/) {
		print "Day Name: $1, Month $2, Day: $3, Year $4\n";
	}
}
date2.pl

This script would now reject dates unless numeric days of the months and numeric years were specified. This script also demonstrates another, lesser used quantifier, {}. In the context of the above script, this quantifier will ensure that there are four digits in the year. This quantifier can also specify a range (e.g. \w{3,10} would match the first three to ten characters in a word. It would fail to match if the word contained less than three characters). The quantifier can also be used with no lower limit. For example the regular expression \d{,3} would match upto three digits in a target string.


Last modified: March 27, 2004 00:04:20 NST (Saturday)