There is a 'gotcha' in perl related to functions that take lists
as parameters (especially the print
function). If
you use parenthesis to enclose the arguments of a function that
takes a list, then make sure that the parenthesis enclose all
the arguments, otherwise things may not behave as you expect.
For example, let's say we wanted to add two numbers and multiply
the result by a third number and display the result:
print (2+3)*4;
This code does not quite do what one would expect. Because parenthesis
are used, the print
function assumes that all of its arguments are
inside the parenthesis. The 2+3
are added together, the
result, 5
is passed to print
which causes
the value 5
to be displayed. Because print
is a function, it returns a value (normally 1, if everything was displayed
okay). This return value is then multiplied by 4
and the
result of the multiplication is not used anywhere (i.e. a void
context). To correctly print out the result of the arithmetic expression,
put parenthesis around everything that print
is to display:
print ((2+3)*4);
Again, the -w
perl option is very helpful in finding
these sorts of errors.
Regular expressions are one of the most important concepts in perl. Quite simply, a regular expression is a pattern that either matches or doesn't match a target string. Regular expressions can be used to do elementary parsing of strings and for identifying and extracting relevant information from files, among other things.
In perl, regular expressions are placed between forward slashes (but,
as with qw
any pair of matching delimiters are okay).
By default, the regular expression is tested against the default
variable $_
. Typically, regular expressions are used in a
scalar boolean context, therefore it is quite common to see them used
in an if
conditional statement or as the condition in a
while
loop.
The characters inside a regular expression can be divided into two categories, literal characters and meta-characters. The literal characters will, of course, literally match themselves. For example, the regular expression:
/hello/
when matched against a string will return true if the string contains
the character sequence hello
. We can write a simple program
that will display all lines that contain a regular expression specified
on the command line as follows:
#!/usr/bin/perl -w use strict; my $search = shift @ARGV; die "No search pattern specified!\n" if ! defined $search; print "The following lines contain the string '$search':\n"; while (<>) { print if /$search/; }
note that variable interpolation takes place inside the slashes
denoting the regular expression. This enables us to use the variable
$search
to represent our search expression. When we run
this script specifying the regular expression search
on the command line and the using the perl script itself as the input,
we get:
$ ./regex1.pl search regex1.pl The following lines contain the string 'search': my $search = shift @ARGV; die "No search pattern specified!\n" if ! defined $search; print "The following lines contain the string '$search':\n"; print if /$search/;
All lines containing the string search
are displayed by
the script. Note that you do have to be careful with this script.
If you specify an invalid regular expression, perl will terminate
when it tries to parse it. This program also demonstrates the
use of the die
function which takes a string argument
and displays the string. It then causes the program to terminate
with a non-zero exit status (perl programs terminate with a zero
status, unless told otherwise). The regular expression matching,
by default is case sensitive (although there is an easy way to
change this).
Matching literal characters is usually not very interesting. The
true power of regular expressions lies in their ability to represent
more sophisticated patterns of characters. To do this, regular
expressions employ meta-characters which can be used to represent
classes of characters or classes of character sequences. One
of the most common meta characters is the period which matches
any character (except newline, \n
). For
example the regular expression he.lo
would match
the strings hello
, heLlo
,
and After he looked at the perl script, his brain imploded.
.
To match arbitrary strings (instead of the default variable
$_
) against regular expressions, we can use the binding
operator =~
in perl. For example, the perl statements:
my $string = "This string has 'hello' in it."; print "Found the regular expression!\n" if $string =~ /.e..o/;
will cause the regular expression .e..o
to be matched
against the variable $string
. The regular expression
goes on the right hand side of the =~
operator. Do
not confuse this operator with the equality relational
operator -- the two are quite different.
Another popular meta-character is the backslash, which can be used to
turn a meta-character into a literal characters. For example to match a
literal forward slash, followed by a period, followed by a backslash, we
can use the regular expression /\/\.\\/
. Note that because
we are using forward slashes as our delimiter, we need to escape the
forward slash inside the regular expression. We can make the regular
expression slightly more readable by using a different delimiter:
m%/\.\\%
. We only have to escape the dot and the backslash,
in this regular expression that uses percent signs as the delimiter.
However, because we are using a delimiter pair other than forward slashes,
we must use m
(for match) in front of the first
percent delimiter.
Other meta-characters can be used to represent quantifiers (or repetitions) of patterns. The three most common quantifiers are
Quantifier | Meaning |
---|---|
* | Zero or more occurrences |
+ | One or more occurrences |
? | Zero or one occurrence. |
For example, the fairly popular regular expression .*
represents any number (include zero) of characters (excluding
newline which is not, by default, matched by the dot meta-character).
This regular expression can be use to match an arbitrary number of
characters. For example if you wish to match lines in file that have
the character sequence hello
and world
on
the same line (and in that order), you can use the regular expression
hello.*world
. This regular expression would match the
strings:
hello, world
This line has "hello" and "world"
helloworld
The last string matches because .*
allows for the possibility
of no characters.
The +
quantifier will match one or more occurrences
of the preceding regular expression. For example, the regular
expression hello +world
would match any line that
had the string hello
followed by at least one
space followed by world
The ?
quantifier will match exactly zero or one occurrence
of the previous regular expression. For example, the regular expression
hello world!?
would match any string containing the phrase
hello world
or hello world!
. The exclamation
mark is optional. The ?
quantifier is sometimes referred
to as the optional quantifier because of this.
Again, if you wish to match one of these quantifiers literally,
use a backslash before them. For example, the regular expression:
\*.*\?.\+
will match any string containing an asterisk
followed by any number of characters, followed by a question mark followed
by exactly one character followed by a plus sign.
We can use parenthesis to group regular expressions together.
This servers two purposes. The first purpose is to change precedence
within regular expressions. For example, as we saw above, the
regular expression hello +world
would match strings
containing hello
followed by one or more spaces followed by
world
. What if we wanted to match a string containing
one or more hello
s followed by a space followed by
world
(e.g. hellohellohello world
)?.
An incorrect way to do this would be to use the regular expression
hello+ world
. This is not right because the +
quantifier binds to the entity immediately to the left of it. In this
regular expression, that entity is the letter o
. Therefore,
this regular expression would match helloooooooo world
, but
not hellohello world
, for example, which is what we wanted.
To fix this we must group the entire sequence hello
inside
parenthesis and then use the +
quantifier: (hello)+
world
.
For more details on the regular expression precedence, see page 111 of S&P.
The second, and very useful, purpose of parenthesis in regular expressions is to extract (or remember) interesting parts of the string that matched a regular expression. This is especially useful when we use a regular expression match in list context, as we will see in the next section.
Alternation allows us to specify a regular expression that can identify one of a number of possible patterns inside a string. For example, let's say we wanted to identify lines in a file that contained dates of of the form:
Mon Mar 31 12:30:15 2003
We can generalize this into the following regular expression:
/(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (..) ..:..:.. (....)/
This regular expression uses the alternation symbol |
in
order to match the day of the week and the name of the month. Note that
the parenthesis around the days of the week and the month names are
required because the precedence of the alternation symbol is very low.
Had we omitted the parenthesis, then the above regular expression would
have been interpreted as:
/(Sun)|(Mon)|(Tue)|(Wed)|(Thu)|(Fri)|(Sat Jan)|(Feb)|(Mar)|(Apr)|(May)|(Jun)|(Jul)|(Aug)|(Sep)|(Oct)|(Nov)|(Dec .. ..:..:.. ....)/
This regular expression would match any string containing Sun
or Mon
... or Sat Jan
or Feb
etc. This
is clearly not what we intended.
Our original regular expression is rather long. The following perl script demonstrates how we can build up the regular expression incrementally:
#!/usr/bin/perl -w use strict; my @days = qw/Sun Mon Tue Wed Thu Fri Sat/; my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/; my $day_re = join "|", @days; my $month_re = join "|", @months; while (<>) { if (/($day_re) ($month_re) (..) ..:..:.. (....)/) { print "Day Name: $1, Month $2, Day: $3, Year $4\n"; } }
This script demonstrates a trivial way to build an alternation regular
expression using an array and perl's join
function. When the
while
loop starts the value of the $day_re
variable will be Sun|Mon|Tue|Wed|Thu|Fri|Sat
and the $month_re
variable will be
Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec
. We can the
have perl interpolate these variables directly into the regular expression.
By grouping these regular expressions, perl will remember the parts of the string that matched the parts of the regular expressions that were enclosed in parenthesis. For example, if the target string being matched against the regular expression was:
Last modified: Mon Mar 31 15:10:16 2003
Then perl would remember that:
($day_re)
in the regular expression matched the
substring Mon
in our target string
($month_re)
in the regular expression matched the
substring Mar
in our target string
(..)
in the regular expression matched the
substring 31
in our target string
(....)
in the regular expression matched the
substring 2003
in our target string
Perl will then store these remembered substrings in the special variables
$1
, $2
, $3
and $4
respectively. And we can then examine/use these variables inside the
body of our conditional. These special variables are set whenever a
regular expression successfully matches. Therefore, you should not
use these variables unless you have actually tested that the regular
expression matched first. Also, remember that because these variables
will be overwritten on each successful match of a regular expression,
introducing another regular expression match between your original
regular expression match and your use of the special variables may cause
unexpected results.
Perl also supports the use of backreferences which allow you to use the remember string itself inside a regular expression. See page 110 of S&P for details.
The regular expression in the above program lets some invalid dates through. For example, if a non-numeric year is specified, the regular expression will still match it, just as long as there are four characters at the end of the string that follow a space. We can tighten up this regular expression to reject these non-date strings by using character clases.
A character class provides a way for a regular expression to match
one of a collection of characters. Character classes are denoted by
the meta-characters [
and ]
. For example,
to represent the lowercase vowels, you can use the character class
[aeiou]
inside a regular expression. To represent the
digits, you can use the character class [0123456789]
.
Perl allows you to specify a range of characters using the hyphen inside
a character class. For example: [A-Za-z]
represents
all alphabetic characters. If you wish to use a hyphen inside a
character class, then escape it with a backslash or use the hyphen
first in the character class. For example [-az]
and
[a\-z]
represent the character class containing the three
characters a
, -
and z
Note that
the special meta-characters (except backslash) lose their meaning inside
a character class. Therefore the character class [?*+.]
will match any one of those characters literally.
It is possible to negate the characters in a class by using the
^
symbol as the very first character inside a character
class. For example, the class [^aeiou]
will match
any character that is not a lowercase vowel.
Perl has several predefined character classes for use. For example,
the sequence \s
represents the character class
[ \t\r\n\f]
, that is, any white space character.
The sequence \d
represents any digit and the sequence
\w
represents any word character, which includes alphabetic
characters (upper and lower case), digits and the underscore character.
We can represent negations of these classes by capitalizing the letter
after the backslash. For example, the character class \S
can be used to represent any non-whitespace character. We can also
use these character classes inside other character classes. For example,
[\d ]
will match either a digit or a space.
Returning to our date parsing script, we can use the digit character class to represent days of the month and the year:
#!/usr/bin/perl -w use strict; my @days = qw/Sun Mon Tue Wed Thu Fri Sat/; my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/; my $day_re = join "|", @days; my $month_re = join "|", @months; while (<>) { if (/($day_re) ($month_re) (\d\d) ..:..:.. (\d{4})/) { print "Day Name: $1, Month $2, Day: $3, Year $4\n"; } }
This script would now reject dates unless numeric days of the months
and numeric years were specified. This script also demonstrates
another, lesser used quantifier, {}
. In the context
of the above script, this quantifier will ensure that there
are four digits in the year. This quantifier can also specify
a range (e.g. \w{3,10}
would match the first
three to ten characters in a word. It would fail to match if the
word contained less than three characters). The quantifier can also
be used with no lower limit. For example the regular expression
\d{,3}
would match upto three digits in a target string.
Last modified: Tue Apr 1 00:20:12 2003