March 22 (Monday) March 26 (Friday)
When doing input, many Perl scripts use the diamond operator,
<>
, as demonstrated by the following script that
counts word occurrences in a file:
#!/usr/bin/perl -w
use strict;
my %counter;
print "\@ARGV is (@ARGV)\n";
while (<>) {
for my $word (split) {
$counter{$word} ++;
}
}
for (sort { $counter{$b} <=> $counter{$a} } keys %counter) {
print "'$_' occurred $counter{$_} time",
$counter{$_} == 1 ? "\n" : "s\n";
}
wc.pl
The code demonstrates a few new features of Perl that we haven't seen before.
@ARGV
array for us (note that variable names
which are all upper case are typically `special' variables in Perl).
This array serves the same purpose as the argv
parameter to
the main()
function in C and C++ programs. Note that there
is no need for an argc
equivalent because the length of this
array is easily determined by using @ARGV
in scalar context.
@ARGV
array is
not the name of the program being run — $ARGV[0]
is actually the first argument on the command line. The special Perl
variable $0
stores the name of the program.
@ARGV
as we desire in
our Perl scripts. For example, we can shift
values from the
front of the array and/or examine the array as we see fit. In the above
program we simply display the contents of the array inside parentheses.
$ ./wc.pl arg1 arg2 arg3 ...
then each of these arguments will be treated as file names. The first
file will be opened and the diamond operator in the while
condition will read each line from the this file and assign the line to
$_
. The body of the while
loop will then
be executed. The next line from the first file will be then be read in and
the process repeated. When all the lines have been read in from the
first file, it is closed and the second file is opened and treated the
same way. This process continues until all the lines in all the files
specified on the command line have been read.
$_
as before. The diamond operator
will return undef
(and cause the while
loop
to terminate) when all the input lines have been read in (i.e.
when the user press Ctrl-D
).
$ ./wc.pl < arg1 arg2 arg3
Then the lines of arg1
would essentially form the standard
input for the program and dropped from the command line by the shell.
@ARGV
would then be set to qw/arg2 arg3/
.
The diamond operator would then read the lines from the arg2
and arg3
files — the lines in arg1
would
be ignored by the diamond operator. (The lines from the arg1
file could still be read by explicitly reading from standard input
i.e. <STDIN>
).
while
loop simply split
s
each line of the input into its constituent words. split
returns the words in an array. The split
function
typically takes two arguments: the first argument is the pattern to be
used to split the line and the second parameter is the line itself.
If the second parameter is not specified, then split
will use the value denoted by the default variable $_
;
if the first parameter is also not specified, then split
will split on whitespace characters. Therefore, the above invocation
of split
is equivalent to split(' ', $_)
.
for
parentheses:
sort { $counter{$b} <=> $counter{$a} } keys %counter
This line of code demonstrates two new concepts:
sort
, by default, does a lexicographical
ordering. We can do a numeric sort by specifying a custom comparison
function to the sort
function. We do so by directly
embedding the comparison function between the sort
function
name and the array to be sorted. For example:
my @array = (512, 64, 256, 16, 1024, 32, 128); print join(",", sort { $a <=> $b } @array), "\n"
This anonymous comparison function will be called many times by the
sort
function. The $a
and $b
variables will be set to the two values that the sort
function wishes to compare. We want our comparison function to return:
-1
if the first argument is less than the second
0
if the first argument is equal to the second
1
if the first argument is greater than the second
<=>
operator, which is
sometimes called the spaceship operator. If we wanted to sort the
numbers in decreasing order, we simply swap the $a
and
$b
. This is what we do in our word count script.
sort
line above also demonstrates how we can sort
a hash by value instead of by key. Inside our comparison
function, $a
and $b
are going to be set to
the keys of the hash. In the context of our script, these keys are
words that were encountered in the input. We then use these words to
determine the number of times each word occurred. We can do this by
giving our %counter
hash the appropriate keys which will
be $a
and $b
in our comparison function.
Regular expressions are one of the most important features of Perl. Quite simply, a regular expression is a pattern that either matches or doesn't match a target string. Regular expressions can be used to do elementary parsing of strings and for identifying and extracting relevant information from files, among other things.
In Perl, regular expressions are typically placed between forward slashes.
By default, the regular expression is tested against the default
variable $_
. Typically, regular expressions are used in a
scalar boolean context, therefore it is quite common to see them used
in an if
conditional statement or as the condition in a
while
loop.
The characters inside a regular expression can be divided into two categories, literal characters and meta-characters. The literal characters will, of course, literally match themselves. For example, the regular expression:
/hello/
when matched against a string will return true if the string contains
the character sequence hello
. We can write a simple program
that will display all lines that contain a regular expression specified
on the command line as follows:
#!/usr/bin/perl -w
use strict;
my $search = shift @ARGV;
die "No search pattern specified!\n" if ! defined $search;
print "The following lines contain the string '$search':\n";
while (<>) {
print if /$search/;
}
regex1.pl
Note that variable interpolation takes place inside the slashes
denoting the regular expression. This enables us to use the variable
$search
to represent our search expression. When we run
this script specifying the regular expression search
on the command line and using the Perl script itself as the input,
we get:
$ ./regex1.pl search regex1.pl The following lines contain the string 'search': my $search = shift @ARGV; die "No search pattern specified!\n" if ! defined $search; print "The following lines contain the string '$search':\n"; print if /$search/;
All lines containing the string search
are displayed by
the script. Note that you do have to be careful with this script.
If you specify an invalid regular expression, Perl will terminate
when it tries to parse it. This program also demonstrates the
use of the die
function which takes a string argument
and displays the string. It then causes the program to terminate with
a non-zero exit status (Perl programs normally terminate with a zero
status, unless told otherwise). The regular expression matching, by
default is case sensitive (although there is an easy way to change this).
Matching literal characters is usually not very interesting. The
true power of regular expressions lies in their ability to represent
more sophisticated patterns of characters. To do this, regular
expressions employ meta-characters which can be used to represent
classes of characters or classes of character sequences. One
of the most common meta characters is the period which matches
any character (except newline, \n
). For
example, the regular expression /he.lo/
would match
the strings hello
, heLlo
,
and After he looked at the Perl script, his brain imploded
.
To match arbitrary strings (instead of the default variable
$_
) against regular expressions, we can use the binding
operator =~
in Perl. For example, the Perl statements:
my $string = "This string has 'hello' in it."; print "Found the regular expression!\n" if $string =~ /.e..o/;
will cause the regular expression /.e..o/
to be matched
against the variable $string
. The regular expression
goes on the right hand side of the =~
operator. Do
not confuse this operator with the equality relational
operator — the two are quite different.
Another popular meta-character is the backslash, which can be used to
turn a meta-character into a literal characters. For example to match a
literal forward slash, followed by a period, followed by a backslash, we
can use the regular expression /\/\.\\/
. Note that because
we are using forward slashes as our delimiter, we need to escape the
forward slash inside the regular expression. We can make the regular
expression slightly more readable by using a different delimiter:
m%/\.\\%
. Because we are using percent signs as delimiters
rather than forward slashes, we only have to escape the dot and the
backslash in this regular expression. However, because we are using a
delimiter pair other than forward slashes, we must use m
(for match) in front of the first percent delimiter.
Last modified: March 24, 2004 17:25:55 NST (Wednesday)