March 26 (Friday) March 31 (Wednesday)
Perl regular expressions also support anchors which allow you to match
regular expressions that occur at certain places inside a string.
The two most common anchors are ^
and $
,
which match the beginning and end of a line, respectively. For example,
the regular expression /^hello/
would match any string that
had hello
at the very beginning of the string. Likewise,
the regular expression /world$/
would match any string that
had world
at the end.
We can also represent a word boundary with the \b
anchor. Therefore, the regular expression /\bhello\b/
will match the strings hello!
and hello,world
,
but not the string othello
.
The following script will display the (alphabetic) words in a
file that do not contain the upper or lower case vowels (y
doesn't count as a vowel in this script).
#!/usr/bin/perl -w
use strict;
while (<>) {
for (split) {
next unless /^[a-z]+$/i;
print "$_\n" if /^[^aeiou]+$/i;
}
}
nonvowels.pl
On each iteration through the while
loop, we
split
the line into its constituent words. We then move
immediately onto the next word if the word we are currently looking at
does not consist entirely of alphabetic characters (each string must
consist entirely of alphabetic characters due to the anchoring of the
regular expression at both ends of the string). The next
key word is similar to the continue
keyword in C and C++
— it brings control immediately back to the top of the innermost
loop (which is the for
loop in this case) and starts the
next iteration.
The i
after the terminating forward slash is an example of a
regular expression modifier in Perl. The i
modifier tells
the regular expression to be case insensitive. Therefore, the regular
expression given will match words which consist entirely of upper and/or
lower case characters. The unless ...
keyword is analogous
to saying if !...
. Therefore, saying
next unless /^[a-z]+$/i;
is identical to:
next if ! /^[a-z]+$/i;
Note that we can do a negation on the regular expression by using the
!
operator. The resulting conditional will be true
if the regular expression does not match $_
.
Finally, in the second statement of the for
loop we
will display the word if none of the characters inside the word
are equal to any of the (upper- or lower-case) vowels. Note that
if we had just used the regular expression without the two anchors,
(i.e. /[^aeiou]+/
), then this would have matched
any string that had at least one non-vowel character.
In regular expressions, it is possible to use backreferences to match a substring that was remembered (by using the parentheses) earlier in the regular expression. For example, consider the following script:
#!/usr/bin/perl -w
use strict;
my $sentence = "This is is a string of text.";
print "Repeated word!: $`->$&<-$'\n" if $sentence =~ /\b(\w+)\s+\1/;
br.pl
The regular expression attempts to find the first occurrence of
a repeated word in the sentence. In the specific example above,
the regular expression will match the $sentence
scalar
— the (\w+)
will match the first is
and the \1
will match the second is
.
The \1
refers to the string matched by the regular
expression in the first pair of paretheses, which was (\w+)
.
Note that we use a word boundary anchor, \b
; otherwise,
the sentence "This is a string of text."
would match the
regular expression because the is
of This
would match (\w+)
and the \1
would match the
word is
that follows This
.
The above example also demonstrates the use of the $`
,
$'
and $&
special Perl variables. After
a regular expression match is performed, $`
matches
everything in the string that occurred before the match, $'
represents everything in the string that occurred after the match
and $&
represents the match itself. Therefore, the
above program will display:
Repeated word!: This ->is is<- a string of text.
Note that backreferences \2
, \3
etc.
can also be used, just as long as there are an appropriate number of
sub-expressions grouped by parentheses. Attempting to use a backreference
for which there isn't a corresponding grouping will cause an error.
Regular expressions such as /.*/
are greedy by default.
That is, they try to match as many characters as possible. For example,
consider the following incorrect expression which attempts to break a
string of characters in two strings: the first one consisting of alphabetic
characters and the second consisting of digits:
print "$1/$2\n" if "name12346789" =~ m/(.*)(\d+)/
This expression will display name12345678/9
. We have
two greedy regular expression in competition — the first
subexpression, .*
, will match nearly all the characters
in the string, leaving only the last (digit) character to be matched by
the second subexpression \d+
. We can fix this by using
a non-greedy quantifier:
print "$1/$2\n" if "name12346789" =~ m/(.*?)(\d+)/
Note the question mark in the subexpression .*?
—
this makes the .*
regular expression non-greedy. In other
words, it will match only a minimal number of characters in its attempt
to satisfy the match. This will leave the digits in the string to be
matched by the \d+
subexpression.
As another example, consider the following Perl script that displays
all the HTML tags found in an HTML file:
#!/usr/bin/perl -w
use strict;
while (<>) {
print "$1\n" while /<(.*?)>/g;
}
tags.pl
Again, we use a non-greedy regular expression in order to match the characters in between the angle brackets that denote an HTML tag. Had we used the expression:
/<(.*)>/
then this greedy regular expression would match too much. For example, if
the HTML file consisted of <html>hello</html>
,
then the greedy regular expression would match:
<
(
.*
)
>
;<
html>hello</html
>
That is, $1
would be html>hello</html
.
For the non-greedy regular expression, the inner while
will run twice setting $1
to html
the first
time and /html
the second time.
Note that the regular expression is being used in a scalar context (the
regular expression is used in the conditional of the while
).
When used in this context, the result of the regular expression will be
true if the regular expression succesfully matched, or false otherwise.
With the g
regular expression modifier, this regular
expression will attempt to match the pattern on the remainder of the
string on each subsequent evaluation. When the regular expression can no
longer be matched, it returns false, thereby terminating the while loop.
Perl also supports substitutions with the s///
operator. The general format of this operator, as
given in the perlop
man
page is
s/
PATTERN/
REPLACEMENT/
.
By default, the substitution will take place on the default variable
$_
. However, as with regular expression matching using
/.../
, we can also use the binding operator, =~
,
to perform substitutions on any string.
For example, to compress all the spaces in the string denoted
by the variable $str
, we can write $str =~
s/ +/ /
. This will search the string $str
looking for an occurrence of one or more contiguous spaces. It will
then replace them with a single space. Unfortunately, this will only
compress the first occurrence of one or more spaces in $str
.
To compress them all we must use the g
modifier on the
end of the substitution: $str =~ s/ +/ /g
.
Better still, we can use the \s
character class to
compress all whitespace characters: $str =~ s/\s+/ /g;
.
The following short Perl script:
$_ = "Hello World\n"; s/(..)/<$1>/g; print;
will display:
<He><ll><o ><Wo><rl>d
(Remember that .
does not match \n
.)
If we wanted to do case-insensitive substitutions, we
can use the i
modifier which will make the matching of
PATTERN case-insensitive. For example, the Perl expression
s/\bmun\b/MUN/ig;
will search the string $_
and replace all words such as such as mun
, Mun
and MuN
with MUN
.
As another example of substitution, consider a Perl script which reads a list of student numbers names and term marks as demonstrated by the following test file:
366533091 Cole Kent 68
402545697 Andrew West 99
544149893 Angela Johnston 93
642776563 Monique Epps 83
257622129 Darko Peter 100
033221495 Gregory Salutue 55
582335451 Ola Svallmark 64
211817030 Gina Simpson 97
951569403 Ela Whiteside 85
899563658 Brian Garrett 92
433097365 Georgett Lott 57
168213321 Candi Lilly 92
051534180 Linda Smith 61
715231817 Sara Rossy 64
183995480 Kimberlee Thomson 53
834110872 Nazmeen Gorzoch 61
276976781 Vic Melvin 56
017101413 Jack Snede 48
389869517 Hank Thomas 73
826916025 Andrew Harkin 50
marks.txt
The below script processes each line of the file, making sure that each is
valid. As part of this processing, it also extracts the ID, name and
mark from each line. It will then change the order of the first and
last names and capitalize the last name. Note that we can remember the
matches in the PATTERN part of the substitution operation and
refer to them in the REPLACEMENT part by using $1
,
$2
etc. The \U
sequence will cause all letters
that occur after it to be uppercased until the \E
sequence
is encountered.
The line is then formatted for output. However, instead of displaying
the line immediately, it is stored in the @lines
array.
This allows us to sort the output in a variety of ways. By default, just
using the sort
function will sort each line lexicographically
(which gives us an ordering by student number, because all the student
numbers have the same number of digits at the beginning of each of the
strings in the @lines
array). We can also sort by grade as
well by making use of the split
function inside an anonymous
compare subroutine and extracting the last element from the
array returned by split
by using -1
as the
index variable. Note that we must enclose the entire split
operation in parentheses; otherwise, the indexing operation will attempt
to take place on the $a
and $b
variables in
the anonymous subroutine.
#!/usr/bin/perl -w
use strict;
my @lines;
while (<>) {
die "Line $.: Invalid line\n" unless /^(\d{9})\s+(.*)\s+(\d+)$/;
my ($num, $name, $mark) = ($1, $2, $3);
$name =~ s/(\w+) (\w+)/\U$2\E, $1/;
push @lines, sprintf "%09d %-25s %3d\n", $num, $name, $mark;
}
print "Students sorted by number:\n";
print sort @lines;
print "\nStudents sorted by decreasing mark:\n";
print sort { (split ' ', $b)[-1] <=> (split ' ', $a)[-1] } @lines;
marks.pl
The special variable $.
in Perl represents the current
line number of the file being read by the Perl script. The variable
is useful when printing diagnostic information about an input file
being parsed by Perl.
Here is the output from the above program when run on the input data given above:
Here is a Perl script that demonstrates a way to parse files which have a format similar to the following example:
(The parser below actually lets things through that it shouldn't but it's okay for demonstration purposes.)
#!/usr/bin/perl -w
use strict;
sub trim_spaces {
my ($str) = @_;
$str =~ s/\s+$//;
$str =~ s/^\s+//;
return $str;
}
while (<>) {
chomp;
next if /^\s*#/; # Ignore lines with comments.
next if /^\s*$/; # Ignore empty lines.
s/#.*//; # Remove comments
if (/^\s*\[(.*)\]\s*$/) {
my $sec = &trim_spaces($1);
print "section name: '$sec'\n";
} elsif (/^\s*(.*)\s*=\s*(.*)\s*$/) {
my $attr = &trim_spaces($1);
my $val = &trim_spaces($2);
print "attribute '$attr' equals '$val'\n";
} else {
print "Line $. invalid: '$_'\n";
}
}
parse.pl
The parser examines each line in the file and skips over lines that consist of only a comment or are empty. It then strips off comments that appear on non-empty lines. The script then tests the line against a couple of regular expressions searching for a match. When it finds a match, it strips any leading or trailing spaces from the relevant substrings that were matched by the regular expression and displays them.
Last modified: March 30, 2004 15:12:28 NST (Tuesday)