March 24 (Wednesday) March 29 (Monday)
Other meta-characters can be used to represent quantifiers (or repetitions) of patterns. The three most common quantifiers are
Quantifier | Meaning |
---|---|
* | Zero or more occurrences |
+ | One or more occurrences |
? | Zero or one occurrence. |
For example, the fairly popular regular expression /.*/
represents any number (include zero) of characters (excluding
newline which is not, by default, matched by the dot meta-character).
This regular expression can be used to match an arbitrary number of
characters. For example, if you wish to match lines in file that have
the character sequence hello
and world
on
the same line (and in that order), you can use the regular expression
/hello.*world/
. This regular expression would match the
strings:
hello, world
This line has "hello" and "world"
helloworld
The last string matches because .*
allows for the possibility
of no characters.
The +
quantifier will match one or more occurrences
of the preceding regular expression. For example, the regular
expression /hello +world/
would match any line that
had the string hello
followed by at least one
space followed by world
The ?
quantifier will match exactly zero or one occurrence
of the previous regular expression. For example, the regular expression
/hello world!?/
would match any string containing the phrase
hello world
or hello world!
. The exclamation
mark is optional. The ?
quantifier is sometimes referred
to as the optional quantifier because of this.
Again, if you wish to match one of these quantifiers literally,
use a backslash before them. For example, the regular expression:
/\*.*\?.\+/
will match any string containing an asterisk
followed by any number of characters, followed by a question mark followed
by exactly one character followed by a plus sign.
We can use parentheses to group regular expressions together.
This serves two purposes. The first purpose is to change precedence
within regular expressions. For example, as we saw above, the
regular expression hello +world
would match strings
containing hello
followed by one or more spaces followed by
world
. What if we wanted to match a string containing
one or more hello
s followed by a space followed by
world
(e.g. hellohellohello world
)?
An incorrect way to do this would be to use the regular expression
hello+ world
. This is not right because the +
quantifier binds to the entity immediately to the left of it. In this
regular expression, that entity is the letter o
. Therefore,
this regular expression would match helloooooooo world
, but
not hellohello world
, for example, which is what we wanted.
To fix this we must group the entire sequence hello
inside
parentheses and then use the +
quantifier: (hello)+
world
.
For more details on the regular expression precedence, see page 111 of S&P.
The second, and very useful, reason for using parentheses in regular expressions is to extract (or remember) interesting parts of the string that matched a regular expression. This is especially useful when we use a regular expression match in list context, as we will see later on.
Alternation allows us to specify a regular expression that can identify one of a number of possible patterns inside a string. For example, let's say we wanted to identify lines in a file that contained dates of of the form:
Tue Mar 23 17:15:13 NST 2004
We can generalize this into the following regular expression (a line break was introduced so that it will fit on the page. Ideally, the entire regular expression should be on a single line):
/(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr| May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (..) ..:..:.. (....)/
This regular expression uses the alternation symbol, |
, in
order to match the day of the week and the name of the month. Note that
the parentheses around the days of the week and the month names are
required because the precedence of the alternation symbol is very low.
Had we omitted the parentheses, then the above regular expression would
have been interpreted as (again, assume that the entire regular expression
is contained on a single line):
/(Sun)|(Mon)|(Tue)|(Wed)|(Thu)|(Fri)|(Sat Jan)|(Feb)|(Mar)|(Apr)| (May)|(Jun)|(Jul)|(Aug)|(Sep)|(Oct)|(Nov)|(Dec .. ..:..:.. ....)/
This regular expression would match any string containing Sun
or Mon
or Tue
or etc. or Sat Jan
or Feb
etc. This
is clearly not what we intended.
Our original regular expression is rather long. The following Perl script demonstrates how we can build up the regular expression incrementally:
#!/usr/bin/perl -w
use strict;
my @days = qw/Sun Mon Tue Wed Thu Fri Sat/;
my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;
my $day_re = join "|", @days;
my $month_re = join "|", @months;
while (<>) {
if (/($day_re) ($month_re) (..) ..:..:.. (....)/) {
print "Day Name: $1, Month $2, Day: $3, Year $4\n";
}
}
date.pl
This script demonstrates a trivial way to build an alternation regular
expression using an array and Perl's join
function. When the
while
loop starts, the value of the $day_re
variable will be Sun|Mon|Tue|Wed|Thu|Fri|Sat
and the $month_re
variable will be
Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec
. We can then
have Perl interpolate these variables directly into the regular expression.
By grouping these regular expressions, Perl will remember the parts of the string that matched the parts of the regular expressions that were enclosed in parentheses. For example, if the target string being matched against the regular expression was:
Last modified: Mon Mar 31 15:10:16 2004
Then Perl would remember that:
($day_re)
in the regular expression matched the
substring Mon
in our target string
($month_re)
in the regular expression matched the
substring Mar
in our target string
(..)
in the regular expression matched the
substring 31
in our target string
(....)
in the regular expression matched the
substring 2004
in our target string
Perl will then store these remembered substrings in the special variables
$1
, $2
, $3
and $4
respectively. We can then examine/use these variables inside the
body of our conditional. These special variables are set whenever a
regular expression successfully matches. Therefore, you should not
use these variables unless you have actually tested that the regular
expression matched first. Also, remember that because these variables
will be overwritten on each successful match of a regular expression,
introducing another regular expression match between your original
regular expression match and your use of the special variables may cause
unexpected results.
The regular expression in the above program lets some invalid dates through. For example, if a non-numeric year is specified, the regular expression will still match it, just as long as there are four characters at the end of the string that follow a space. We can tighten up this regular expression to reject these non-date strings by using character classes.
A character class provides a way for a regular expression to match
one of a collection of characters. Character classes are denoted by
the meta-characters [
and ]
. For example,
to represent the lowercase vowels, you can use the character class
[aeiou]
inside a regular expression. To represent the
digits, you can use the character class [0123456789]
.
Perl allows you to specify a range of characters using the hyphen inside
a character class. For example: [A-Za-z]
represents
all alphabetic characters. If you wish to use a hyphen inside a
character class, then escape it with a backslash or use the hyphen
first in the character class. For example [-az]
and
[a\-z]
represent the character class containing the three
characters a
, -
and z
. Note that
the special meta-characters (except backslash) lose their meaning inside
a character class. Therefore the character class [?*+.]
will match any one of those characters literally.
It is possible to negate the characters in a class by using the
^
symbol as the very first character inside a character
class. For example, the class [^aeiou]
will match
any character that is not a lowercase vowel.
Perl has several predefined character classes for us to use. For
example, the sequence \s
represents the character class
[ \t\r\n\f]
, that is, any whitespace character.
The sequence \d
represents any digit and the sequence
\w
represents any word character, which includes alphabetic
characters (upper and lower case), digits and the underscore character.
We can represent negations of these classes by capitalizing the letter
after the backslash. For example, the character class \S
can be used to represent any non-whitespace character. We can also use
these character classes inside other character classes. For example,
[\d ]
will match either a digit or a space.
Returning to our date parsing script, we can use the digit character class to represent days of the month and the year:
#!/usr/bin/perl -w
use strict;
my @days = qw/Sun Mon Tue Wed Thu Fri Sat/;
my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;
my $day_re = join "|", @days;
my $month_re = join "|", @months;
while (<>) {
if (/($day_re) ($month_re) ([\d ]\d) ..:..:.. (\d{4})/) {
print "Day Name: $1, Month $2, Day: $3, Year $4\n";
}
}
date2.pl
This script would now reject dates unless numeric days of the months
and numeric years were specified. This script also demonstrates
another, lesser used quantifier, {}
. In the context
of the above script, this quantifier will ensure that there
are four digits in the year. This quantifier can also specify
a range (e.g. \w{3,10}
would match the first
three to ten characters in a word. It would fail to match if the
word contained less than three characters). The quantifier can also
be used with no lower limit. For example the regular expression
\d{,3}
would match upto three digits in a target string.
Last modified: March 27, 2004 00:04:20 NST (Saturday)