Perl regular expressions also support anchors which allow you to match
regular expressions that occur at certain places inside a string.
The two most common anchors are ^
and $
,
which match the beginning and end of a line, respectively. For example,
then regular expression ^hello
would match any string that
had hello
at the very beginning of the string. Likewise,
the regular expression world$
would match any string that
had world
at the end.
We can also represent word boundary with the code \b
anchor. Therefore, the regular expression \bhello\b
will match the strings hello!
and hello,world
,
but not the string othello
.
The following script will display the (alphabetic) words in a file that do not contain the traditional upper or lower case vowels.
#!/usr/bin/perl -w use strict; while (<>) { for (split) { next unless /^[a-z]+$/i; print "$_\n" if /^[^aeiou]+$/i; } }
On each iteration through the while
loop, we
split
the line into its constituent words. We then move
immediately onto the next word if the word we are currently looking at
does not consist entirely of alphabetic characters (each string must
consist entirely of alphabetic characters due to the anchoring of the
regular expression at both ends of the string). The next
key word is similar to the continue
keyword in C and C++
-- it brings control immediately back to the top of the inner most loop
(which is the for
loop in this case) and starts the next
iteration.
The i
after the close forward slash is
an example of a regular expression modifier in perl. The i
modifier tells the regular expression to be case insensitive. Therefore,
the regular expression given will match words which consist entirely of
upper and lower case characters.
The unless ...
keyword is analogous to saying
if !...
. Therefore, saying
next unless /^[a-z]+$/i;
is identical to:
next if ! /^[a-z]+$/i;
(Note that we can do a negation on the regular expression by using the
!
operator.)
Finally, in the second statement of the for
loop we
will display the word if none of the characters inside the word
are equal to any of the (upper- or lower-case) vowels. Note that
if we had just used the regular expression without the two anchors,
(i.e. [^aeiou]+
), then this would have matched
any string that had at least one non-vowel character.
Perl also supports substitutions with the s///
operator.
The general format of this operator, as given in the perlop
man
page is s/PATTERN/REPLACEMENT/. By default, the
substitution will take place on the default variable $_
.
However, as with regular expression matching using /.../
,
we can also use the binding operator to perform substitutions on any
string.
For example, to compress all the spaces in the string denoted by variable
$str
, we can write $str =~ s/ +/ /
.
This will search the string $str
looking for an occurrence of
one or more contiguous spaces. It will then replace them with a single
space. Unfortunately, this will only compress the first occurrence of
one or more spaces in $str
. To compress them all we must use
the g
modifier on the end of the substitution: $str =~
s/ +/ /g
. The i
modifier is also supported
so that the PATTERN match will be case insensitive.
As another example of substitution, consider the perl script below which reads a list of student numbers names and term marks as demonstrated by the following test file:
366533091 Cole Kent 68 402545697 Andrew West 99 544149893 Angela Johnston 93 642776563 Monique Epps 83 257622129 Darko Peter 100 033221495 Gregory Salutue 55 582335451 Ola Svallmark 64 211817030 Gina Simpson 97 951569403 Ela Whiteside 85 899563658 Brian Garrett 92 433097365 Georgett Lott 57 168213321 Candi Lilly 92 051534180 Linda Smith 61 715231817 Sara Rossy 64 183995480 Kimberlee Thomson 53 834110872 Nazmeen Gorzoch 61 276976781 Vic Melvin 56 017101413 Jack Snede 48 389869517 Hank Thomas 73 826916025 Andrew Harkin 50
The script to process each line of the file, making sure that it
is valid. It will then change the order of the first and last names and
capitalize the last name. Note that we can remember the matches in the
PATTERN part of the substitution operation and refer to the in
the REPLACEMENT part by using $1
, $2
etc. The \U
sequence will cause all letters that occur after
it to be uppercased until the \E
sequence is encountered.
The line is then formatted for output. However, instead of displaying
the line immediately, it is stored in the @lines
array.
This allows us to sort the output in a variety of ways. By default, just
using the sort
function will sort each line lexicographically
(which gives us an ordering by student number, because all the student
numbers have the same number of digits). We can also sort by grade as
well by making use of the split
function inside an anonymous
compare subroutine and extracting the last element from the
array returned by split
by using -1
as the
index variable. Note that we must enclose the entire split
operation in parenthesis; otherwise, the indexing operation will attempt
to take place on the $a
and $b
variables in
the anonymous subroutine.
#!/usr/bin/perl -w use strict; my @lines; while (<>) { die "Line $.: Invalid line\n" unless /^(\d{9})\s+(.*)\s+(\d+)$/; my ($num, $name, $mark) = ($1, $2, $3, $4); $name =~ s/(\w+) (\w+)/\U$2\E, $1/; push @lines, sprintf "%09d %-25s %3d\n", $num, $name, $mark; } print "Students sorted by number:\n"; for (sort @lines) { print; } print "\nStudents sorted by decreasing mark:\n"; for (sort { (split ' ', $b)[-1] <=> (split ' ', $a)[-1] } @lines) { print; }
The special variable $.
in perl represents the current
line number of the file being read by the perl script. The variable
is useful when printing diagnostic information about an input file
being parsed by perl.
Here is the output from the above program when run on the input data given above:
Students sorted by number: 017101413 SNEDE, Jack 48 033221495 SALUTUE, Gregory 55 051534180 SMITH, Linda 61 168213321 LILLY, Candi 92 183995480 THOMSON, Kimberlee 53 211817030 SIMPSON, Gina 97 257622129 PETER, Darko 100 276976781 MELVIN, Vic 56 366533091 KENT, Cole 68 389869517 THOMAS, Hank 73 402545697 WEST, Andrew 99 433097365 LOTT, Georgett 57 544149893 JOHNSTON, Angela 93 582335451 SVALLMARK, Ola 64 642776563 EPPS, Monique 83 715231817 ROSSY, Sara 64 826916025 HARKIN, Andrew 50 834110872 GORZOCH, Nazmeen 61 899563658 GARRETT, Brian 92 951569403 WHITESIDE, Ela 85 Students sorted by decreasing mark: 257622129 PETER, Darko 100 402545697 WEST, Andrew 99 211817030 SIMPSON, Gina 97 544149893 JOHNSTON, Angela 93 168213321 LILLY, Candi 92 899563658 GARRETT, Brian 92 951569403 WHITESIDE, Ela 85 642776563 EPPS, Monique 83 389869517 THOMAS, Hank 73 366533091 KENT, Cole 68 715231817 ROSSY, Sara 64 582335451 SVALLMARK, Ola 64 834110872 GORZOCH, Nazmeen 61 051534180 SMITH, Linda 61 433097365 LOTT, Georgett 57 276976781 MELVIN, Vic 56 033221495 SALUTUE, Gregory 55 183995480 THOMSON, Kimberlee 53 826916025 HARKIN, Andrew 50 017101413 SNEDE, Jack 48
Here is a perl script that demonstrates a way to parse files which have a format similar to the following example:
# This is a test file. [Startup] directory = /users/cs/study # Testing comment printer = linuxlj groupid = 9002 [Shutdown] # Comment test. confirm = true reboot = false
(The parser below actually lets things through that it shouldn't but it's okay for demonstration purposes.)
#!/usr/bin/perl -w use strict; sub trim_spaces { my ($str) = @_; return $str; $str =~ s/\s+$//; $str =~ s/^\s+//; return $str; } while (<>) { chomp; next if /^\s*#/; # Ignore lines with comments. next if /^\s*$/; # Ingnore empty lines. s/#.*//; # Remove comments if (/^\s*\[(.*)\]\s*$/) { my $sec = &trim_spaces($1); print "section name: '$sec'\n"; next; } elsif (/^\s*(.*)\s*=\s*(.*)\s*$/) { my $attr = &trim_spaces($1); my $val = &trim_spaces($2); print "attribute '$attr' equals '$val'\n"; next; } else { print "Line $. invalid: '$_'\n"; } }
The parser examines each line in the file and skips over lines that consist of only a comment or are empty. It then strips off comments that appear on non-empty lines. The script then tests the line against a couple of regular expressions searching for a match. When it finds a match, it strips any leading or trailing spaces from the relevant substrings that were matched by the regular expression and displays them.
Like C's fopen()
and fclose()
function, perl
supports a means of doing input and output to a file. To demonstrate
File I/O in perl, consider the following script:
#!/usr/bin/perl -w use strict; my ($passwd, $results) = qw< /etc/passwd results.out >; my %shells; open FILE, $passwd or die "Cannot open password file: $!\n"; while (<FILE>) { chomp; next if /ppp/; $shells{(split /:/)[-1]} ++; } close FILE or die "Cannot close password file: $!\n"; open RES, "> $results" or die "Cannot open '$results' for write: $!\n"; for (sort { $shells{$b} <=> $shells{$a} } keys %shells ) { print RES "$_: $shells{$_}\n"; } close RES or die "Cannot close $results'";
There are several things to note about the above script:
open
function. This
function typically takes two arguments: a file handle and a scalar
representing the name of the file in the file system to open. The
file handle represents the connection between your perl script and
the file itself and is typically written in all capital letters.
We've already seen one file handle already: STDIN
.
By default, open
will open the file for read access.
We'll see opening for write access later.
If the open
call fails, it will return a false
value. Because the open
function is being called
as part of an or
logical operation, the second
statement to the right of the or
operation will have to be
evaluated. This statement causes termination of the program. This
is a very common idiom in perl.
Note that the or
logical operator is of lower precedence
than the traditional ||
operator. If we wanted to use
the more conventional ||
operator, we would have to put
the parameters of open
in parenthesis in order to ensure
that the precedence of the operators inside the statement make sense.
open (FILE, $passwd) || die "Cannot open password file: $!\n";
A lot of (older) perl code uses the ||
operator but
using or
in the above context seems to be increasing
in popularity.
$!
special variable using the string argument to the
die
function will be interpolated to an error message that
explains why the most recent function call failed. In the code above,
the $!
variable will contain a string that indicates why
the script was not able to open the file. For example, if we tried to
open a file that we did not have read permission on, then $!
will be set to the string: Permission denied
. If the file
did not exist, then $!
will be set to No such file
or directory
. The contents of the $!
variable are
quite useful and should be displayed when one of perl's function calls
fail for some reason.
<FILE>
can be used to read a line
from the file handle denoted by FILE
. Again, we've
seen this notation before when reading lines from STDIN
.
Because we are using the <FILE>
notation inside
a while
condition, the $_
default variable
will be set to each line in the opened file on each iteration through
the loop.
chomp
the
line, we skip over lines that have ppp
in them. We then
increment a hash counter. The key in this has is the shell that is used
by the current system user read in from the passwd
file.
Note that the statement (split /:/)[-1]
will do two things:
It will first split the line of input using the regular expression
:
as a delimiter (each line of the passwd
is delimited by colons). The result of the split
function
is an array of elements. The statement will then use the indexing
operator to access the last element in this array. Using a negative
number the square brackets is a simple way of indexing an list starting
from the back end of the list. Note that the parenthesis around
the call to split
are compulsory, since the indexing
operation must operate on an list that the split function generates.
close
function on the filehandle
that we opened earlier. While it is not too common, the close
operation can fail so it should be tested like the open
function just to be safe, although in practice this is rarely done.
$result
scalar
which will be used to store the results of the script. When opening a
file for write access, we give the filename the special first character
of >
. This will cause perl to open the filename for
write, thereby overwriting the file if it existed earlier. As with
opening for read, we implicitly check the result of the open
operation and die
(with an appropriate error message using
$!
) if we were unable to open the file for write access
print
(or printf
) function
in a special manner: The first argument to the print
function is the name of the file name with which we opened the file.
The second argument is a list representing the information that we want
to store in the file.
print RES "$_: $shells{$_}\n"; # NO comma following RES!!
Note that there is no comma separating the file handle from the list. This last point is very important.
Last modified: Fri Apr 4 14:37:49 2003