Computer Science 3710 -- March 24, 2004

Input/Output (S&P — Chapter 6, cont'd)

The Diamond operator and command line arguments in Perl

When doing input, many Perl scripts use the diamond operator, <>, as demonstrated by the following script that counts word occurrences in a file:

#!/usr/bin/perl -w

use strict;

my %counter;

print "\@ARGV is (@ARGV)\n";

while (<>) {
	for my $word (split) {
		$counter{$word} ++;
	}
}

for (sort { $counter{$b} <=> $counter{$a} } keys %counter) {
	print "'$_' occurred $counter{$_} time",
		$counter{$_} == 1 ? "\n" : "s\n";
}
wc.pl

The code demonstrates a few new features of Perl that we haven't seen before.

When a Perl program is started, the command line arguments are stored in the @ARGV array for us (note that variable names which are all upper case are typically `special' variables in Perl). This array serves the same purpose as the argv parameter to the main() function in C and C++ programs. Note that there is no need for an argc equivalent because the length of this array is easily determined by using @ARGV in scalar context.

Unlike C/C++, the first element of the @ARGV array is not the name of the program being run — $ARGV[0] is actually the first argument on the command line. The special Perl variable $0 stores the name of the program.

We are free to examine/modify @ARGV as we desire in our Perl scripts. For example, we can shift values from the front of the array and/or examine the array as we see fit. In the above program we simply display the contents of the array inside parentheses.

Next, we use the diamond operator to do input. What this operator does depends upon how the Perl script was invoked:

If there were arguments specified on the command line, e.g.
```
$ ./wc.pl arg1 arg2 arg3 ...
```
then each of these arguments will be treated as file names. The first file will be opened and the diamond operator in the while condition will read each line from the this file and assign the line to $_. The body of the while loop will then be executed. The next line from the first file will be then be read in and the process repeated. When all the lines have been read in from the first file, it is closed and the second file is opened and treated the same way. This process continues until all the lines in all the files specified on the command line have been read.
If there were no command line arguments, then the diamond operator would attempt to read input lines from standard input and assign each line to $_ as before. The diamond operator will return undef (and cause the while loop to terminate) when all the input lines have been read in (i.e. when the user press Ctrl-D).

If you were to invoke the script as:

$ ./wc.pl < arg1 arg2 arg3

Then the lines of arg1 would essentially form the standard input for the program and dropped from the command line by the shell. @ARGV would then be set to qw/arg2 arg3/. The diamond operator would then read the lines from the arg2 and arg3 files — the lines in arg1 would be ignored by the diamond operator. (The lines from the arg1 file could still be read by explicitly reading from standard input i.e. <STDIN>).

The body of the while loop simply splits each line of the input into its constituent words. split returns the words in an array. The split function typically takes two arguments: the first argument is the pattern to be used to split the line and the second parameter is the line itself. If the second parameter is not specified, then split will use the value denoted by the default variable $_; if the first parameter is also not specified, then split will split on whitespace characters. Therefore, the above invocation of split is equivalent to split(' ', $_).

Finally, we display the words in decreasing order of occurrence (the most commonly occurring word will be displayed first followed by the second most commonly occurring word etc.). The important part of the loop is the contents of the for parentheses:

sort { $counter{$b} <=> $counter{$a} } keys %counter

This line of code demonstrates two new concepts:

It demonstrates how to numerically sort an array of numbers. Remember that sort, by default, does a lexicographical ordering. We can do a numeric sort by specifying a custom comparison function to the sort function. We do so by directly embedding the comparison function between the sort function name and the array to be sorted. For example:
```
my @array = (512, 64, 256, 16, 1024, 32, 128);
print join(",", sort { $a <=> $b } @array), "\n"
```
This anonymous comparison function will be called many times by the sort function. The $a and $b variables will be set to the two values that the sort function wishes to compare. We want our comparison function to return:
- -1 if the first argument is less than the second
- 0 if the first argument is equal to the second
- 1 if the first argument is greater than the second
Fortunately, Perl has an operator that works on two numbers that does exactly this — the <=> operator, which is sometimes called the spaceship operator. If we wanted to sort the numbers in decreasing order, we simply swap the $a and $b. This is what we do in our word count script.
The sort line above also demonstrates how we can sort a hash by value instead of by key. Inside our comparison function, $a and $b are going to be set to the keys of the hash. In the context of our script, these keys are words that were encountered in the input. We then use these words to determine the number of times each word occurred. We can do this by giving our %counter hash the appropriate keys which will be $a and $b in our comparison function.

Regular Expressions (S&P — Chapters 7/8/9)

Regular expressions are one of the most important features of Perl. Quite simply, a regular expression is a pattern that either matches or doesn't match a target string. Regular expressions can be used to do elementary parsing of strings and for identifying and extracting relevant information from files, among other things.

In Perl, regular expressions are typically placed between forward slashes. By default, the regular expression is tested against the default variable $_. Typically, regular expressions are used in a scalar boolean context, therefore it is quite common to see them used in an if conditional statement or as the condition in a while loop.

Meta-characters

The characters inside a regular expression can be divided into two categories, literal characters and meta-characters. The literal characters will, of course, literally match themselves. For example, the regular expression:

when matched against a string will return true if the string contains the character sequence hello. We can write a simple program that will display all lines that contain a regular expression specified on the command line as follows:

#!/usr/bin/perl -w

use strict;

my $search = shift @ARGV;
die "No search pattern specified!\n" if ! defined $search;

print "The following lines contain the string '$search':\n";

while (<>) {
	print if /$search/;
}
regex1.pl

Note that variable interpolation takes place inside the slashes denoting the regular expression. This enables us to use the variable $search to represent our search expression. When we run this script specifying the regular expression search on the command line and using the Perl script itself as the input, we get:

$ ./regex1.pl search regex1.pl The following lines contain the string 'search': my $search = shift @ARGV; die "No search pattern specified!\n" if ! defined $search; print "The following lines contain the string '$search':\n"; print if /$search/;

All lines containing the string search are displayed by the script. Note that you do have to be careful with this script. If you specify an invalid regular expression, Perl will terminate when it tries to parse it. This program also demonstrates the use of the die function which takes a string argument and displays the string. It then causes the program to terminate with a non-zero exit status (Perl programs normally terminate with a zero status, unless told otherwise). The regular expression matching, by default is case sensitive (although there is an easy way to change this).

Matching literal characters is usually not very interesting. The true power of regular expressions lies in their ability to represent more sophisticated patterns of characters. To do this, regular expressions employ meta-characters which can be used to represent classes of characters or classes of character sequences. One of the most common meta characters is the period which matches any character (except newline, \n). For example, the regular expression /he.lo/ would match the strings hello, heLlo, and After he looked at the Perl script, his brain imploded.

To match arbitrary strings (instead of the default variable $_) against regular expressions, we can use the binding operator =~ in Perl. For example, the Perl statements:

will cause the regular expression /.e..o/ to be matched against the variable $string. The regular expression goes on the right hand side of the =~ operator. Do not confuse this operator with the equality relational operator — the two are quite different.

Another popular meta-character is the backslash, which can be used to turn a meta-character into a literal characters. For example to match a literal forward slash, followed by a period, followed by a backslash, we can use the regular expression /\/\.\\/. Note that because we are using forward slashes as our delimiter, we need to escape the forward slash inside the regular expression. We can make the regular expression slightly more readable by using a different delimiter: m%/\.\\%. Because we are using percent signs as delimiters rather than forward slashes, we only have to escape the dot and the backslash in this regular expression. However, because we are using a delimiter pair other than forward slashes, we must use m (for match) in front of the first percent delimiter.

March 24, 2004 (Wednesday)

Input/Output (S&P — Chapter 6, cont'd)

The Diamond operator and command line arguments in Perl

Regular Expressions (S&P — Chapters 7/8/9)

Meta-characters