Main

March 24, 2004 (Wednesday)

Input/Output (S&P — Chapter 6, cont'd)

The Diamond operator and command line arguments in Perl

When doing input, many Perl scripts use the diamond operator, <>, as demonstrated by the following script that counts word occurrences in a file:

#!/usr/bin/perl -w

use strict;

my %counter;

print "\@ARGV is (@ARGV)\n";

while (<>) {
	for my $word (split) {
		$counter{$word} ++;
	}
}

for (sort { $counter{$b} <=> $counter{$a} } keys %counter) {
	print "'$_' occurred $counter{$_} time",
		$counter{$_} == 1 ? "\n" : "s\n";
}
wc.pl

The code demonstrates a few new features of Perl that we haven't seen before.

Regular Expressions (S&P — Chapters 7/8/9)

Regular expressions are one of the most important features of Perl. Quite simply, a regular expression is a pattern that either matches or doesn't match a target string. Regular expressions can be used to do elementary parsing of strings and for identifying and extracting relevant information from files, among other things.

In Perl, regular expressions are typically placed between forward slashes. By default, the regular expression is tested against the default variable $_. Typically, regular expressions are used in a scalar boolean context, therefore it is quite common to see them used in an if conditional statement or as the condition in a while loop.

Meta-characters

The characters inside a regular expression can be divided into two categories, literal characters and meta-characters. The literal characters will, of course, literally match themselves. For example, the regular expression:

/hello/

when matched against a string will return true if the string contains the character sequence hello. We can write a simple program that will display all lines that contain a regular expression specified on the command line as follows:

#!/usr/bin/perl -w

use strict;

my $search = shift @ARGV;
die "No search pattern specified!\n" if ! defined $search;

print "The following lines contain the string '$search':\n";

while (<>) {
	print if /$search/;
}
regex1.pl

Note that variable interpolation takes place inside the slashes denoting the regular expression. This enables us to use the variable $search to represent our search expression. When we run this script specifying the regular expression search on the command line and using the Perl script itself as the input, we get:

$ ./regex1.pl search regex1.pl
The following lines contain the string 'search':
my $search = shift @ARGV;
die "No search pattern specified!\n" if ! defined $search;
print "The following lines contain the string '$search':\n";
        print if /$search/;

All lines containing the string search are displayed by the script. Note that you do have to be careful with this script. If you specify an invalid regular expression, Perl will terminate when it tries to parse it. This program also demonstrates the use of the die function which takes a string argument and displays the string. It then causes the program to terminate with a non-zero exit status (Perl programs normally terminate with a zero status, unless told otherwise). The regular expression matching, by default is case sensitive (although there is an easy way to change this).

Matching literal characters is usually not very interesting. The true power of regular expressions lies in their ability to represent more sophisticated patterns of characters. To do this, regular expressions employ meta-characters which can be used to represent classes of characters or classes of character sequences. One of the most common meta characters is the period which matches any character (except newline, \n). For example, the regular expression /he.lo/ would match the strings hello, heLlo, and After he looked at the Perl script, his brain imploded.

To match arbitrary strings (instead of the default variable $_) against regular expressions, we can use the binding operator =~ in Perl. For example, the Perl statements:

my $string = "This string has 'hello' in it.";
print "Found the regular expression!\n" if $string =~ /.e..o/;

will cause the regular expression /.e..o/ to be matched against the variable $string. The regular expression goes on the right hand side of the =~ operator. Do not confuse this operator with the equality relational operator — the two are quite different.

Another popular meta-character is the backslash, which can be used to turn a meta-character into a literal characters. For example to match a literal forward slash, followed by a period, followed by a backslash, we can use the regular expression /\/\.\\/. Note that because we are using forward slashes as our delimiter, we need to escape the forward slash inside the regular expression. We can make the regular expression slightly more readable by using a different delimiter: m%/\.\\%. Because we are using percent signs as delimiters rather than forward slashes, we only have to escape the dot and the backslash in this regular expression. However, because we are using a delimiter pair other than forward slashes, we must use m (for match) in front of the first percent delimiter.


Last modified: March 24, 2004 17:25:55 NST (Wednesday)