Monday, March 31, 2003

Other I/O issues

There is a 'gotcha' in perl related to functions that take lists as parameters (especially the print function). If you use parenthesis to enclose the arguments of a function that takes a list, then make sure that the parenthesis enclose all the arguments, otherwise things may not behave as you expect. For example, let's say we wanted to add two numbers and multiply the result by a third number and display the result:

print (2+3)*4;

This code does not quite do what one would expect. Because parenthesis are used, the print function assumes that all of its arguments are inside the parenthesis. The 2+3 are added together, the result, 5 is passed to print which causes the value 5 to be displayed. Because print is a function, it returns a value (normally 1, if everything was displayed okay). This return value is then multiplied by 4 and the result of the multiplication is not used anywhere (i.e. a void context). To correctly print out the result of the arithmetic expression, put parenthesis around everything that print is to display:

print ((2+3)*4);

Again, the -w perl option is very helpful in finding these sorts of errors.

Regular Expressions (S&P -- Chapter 7/8/9)

Regular expressions are one of the most important concepts in perl. Quite simply, a regular expression is a pattern that either matches or doesn't match a target string. Regular expressions can be used to do elementary parsing of strings and for identifying and extracting relevant information from files, among other things.

In perl, regular expressions are placed between forward slashes (but, as with qw any pair of matching delimiters are okay). By default, the regular expression is tested against the default variable $_. Typically, regular expressions are used in a scalar boolean context, therefore it is quite common to see them used in an if conditional statement or as the condition in a while loop.

Meta-characters

The characters inside a regular expression can be divided into two categories, literal characters and meta-characters. The literal characters will, of course, literally match themselves. For example, the regular expression:

/hello/

when matched against a string will return true if the string contains the character sequence hello. We can write a simple program that will display all lines that contain a regular expression specified on the command line as follows:


#!/usr/bin/perl -w

use strict;

my $search = shift @ARGV;
die "No search pattern specified!\n" if ! defined $search;

print "The following lines contain the string '$search':\n";

while (<>) {
	print if /$search/;
}

note that variable interpolation takes place inside the slashes denoting the regular expression. This enables us to use the variable $search to represent our search expression. When we run this script specifying the regular expression search on the command line and the using the perl script itself as the input, we get:

$ ./regex1.pl search regex1.pl
The following lines contain the string 'search':
my $search = shift @ARGV;
die "No search pattern specified!\n" if ! defined $search;
print "The following lines contain the string '$search':\n";
        print if /$search/;

All lines containing the string search are displayed by the script. Note that you do have to be careful with this script. If you specify an invalid regular expression, perl will terminate when it tries to parse it. This program also demonstrates the use of the die function which takes a string argument and displays the string. It then causes the program to terminate with a non-zero exit status (perl programs terminate with a zero status, unless told otherwise). The regular expression matching, by default is case sensitive (although there is an easy way to change this).

Matching literal characters is usually not very interesting. The true power of regular expressions lies in their ability to represent more sophisticated patterns of characters. To do this, regular expressions employ meta-characters which can be used to represent classes of characters or classes of character sequences. One of the most common meta characters is the period which matches any character (except newline, \n). For example the regular expression he.lo would match the strings hello, heLlo, and After he looked at the perl script, his brain imploded..

To match arbitrary strings (instead of the default variable $_) against regular expressions, we can use the binding operator =~ in perl. For example, the perl statements:

my $string = "This string has 'hello' in it.";
print "Found the regular expression!\n" if $string =~ /.e..o/;

will cause the regular expression .e..o to be matched against the variable $string. The regular expression goes on the right hand side of the =~ operator. Do not confuse this operator with the equality relational operator -- the two are quite different.

Another popular meta-character is the backslash, which can be used to turn a meta-character into a literal characters. For example to match a literal forward slash, followed by a period, followed by a backslash, we can use the regular expression /\/\.\\/. Note that because we are using forward slashes as our delimiter, we need to escape the forward slash inside the regular expression. We can make the regular expression slightly more readable by using a different delimiter: m%/\.\\%. We only have to escape the dot and the backslash, in this regular expression that uses percent signs as the delimiter. However, because we are using a delimiter pair other than forward slashes, we must use m (for match) in front of the first percent delimiter.

Quantifiers

Other meta-characters can be used to represent quantifiers (or repetitions) of patterns. The three most common quantifiers are

Quantifier	Meaning
*	Zero or more occurrences
+	One or more occurrences
?	Zero or one occurrence.

For example, the fairly popular regular expression .* represents any number (include zero) of characters (excluding newline which is not, by default, matched by the dot meta-character). This regular expression can be use to match an arbitrary number of characters. For example if you wish to match lines in file that have the character sequence hello and world on the same line (and in that order), you can use the regular expression hello.*world. This regular expression would match the strings:

hello, world
This line has "hello" and "world"
helloworld

The last string matches because .* allows for the possibility of no characters.

The + quantifier will match one or more occurrences of the preceding regular expression. For example, the regular expression hello +world would match any line that had the string hello followed by at least one space followed by world

The ? quantifier will match exactly zero or one occurrence of the previous regular expression. For example, the regular expression hello world!? would match any string containing the phrase hello world or hello world!. The exclamation mark is optional. The ? quantifier is sometimes referred to as the optional quantifier because of this.

Again, if you wish to match one of these quantifiers literally, use a backslash before them. For example, the regular expression: \*.*\?.\+ will match any string containing an asterisk followed by any number of characters, followed by a question mark followed by exactly one character followed by a plus sign.

Grouping using parenthesis

We can use parenthesis to group regular expressions together. This servers two purposes. The first purpose is to change precedence within regular expressions. For example, as we saw above, the regular expression hello +world would match strings containing hello followed by one or more spaces followed by world. What if we wanted to match a string containing one or more hellos followed by a space followed by world (e.g. hellohellohello world)?. An incorrect way to do this would be to use the regular expression hello+ world. This is not right because the + quantifier binds to the entity immediately to the left of it. In this regular expression, that entity is the letter o. Therefore, this regular expression would match helloooooooo world, but not hellohello world, for example, which is what we wanted. To fix this we must group the entire sequence hello inside parenthesis and then use the + quantifier: (hello)+ world.

For more details on the regular expression precedence, see page 111 of S&P.

The second, and very useful, purpose of parenthesis in regular expressions is to extract (or remember) interesting parts of the string that matched a regular expression. This is especially useful when we use a regular expression match in list context, as we will see in the next section.

Alternation

Alternation allows us to specify a regular expression that can identify one of a number of possible patterns inside a string. For example, let's say we wanted to identify lines in a file that contained dates of of the form:

Mon Mar 31 12:30:15 2003

We can generalize this into the following regular expression:

/(Sun|Mon|Tue|Wed|Thu|Fri|Sat) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (..) ..:..:.. (....)/

This regular expression uses the alternation symbol | in order to match the day of the week and the name of the month. Note that the parenthesis around the days of the week and the month names are required because the precedence of the alternation symbol is very low. Had we omitted the parenthesis, then the above regular expression would have been interpreted as:

/(Sun)|(Mon)|(Tue)|(Wed)|(Thu)|(Fri)|(Sat Jan)|(Feb)|(Mar)|(Apr)|(May)|(Jun)|(Jul)|(Aug)|(Sep)|(Oct)|(Nov)|(Dec .. ..:..:.. ....)/

This regular expression would match any string containing Sun or Mon ... or Sat Jan or Feb etc. This is clearly not what we intended.

Our original regular expression is rather long. The following perl script demonstrates how we can build up the regular expression incrementally:


#!/usr/bin/perl -w

use strict;

my @days = qw/Sun Mon Tue Wed Thu Fri Sat/;
my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;

my $day_re = join "|", @days;
my $month_re = join "|", @months;

while (<>) {
	if (/($day_re) ($month_re) (..) ..:..:.. (....)/) {
		print "Day Name: $1, Month $2, Day: $3, Year $4\n";
	}
}

This script demonstrates a trivial way to build an alternation regular expression using an array and perl's join function. When the while loop starts the value of the $day_re variable will be Sun|Mon|Tue|Wed|Thu|Fri|Sat and the $month_re variable will be Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec. We can the have perl interpolate these variables directly into the regular expression.

By grouping these regular expressions, perl will remember the parts of the string that matched the parts of the regular expressions that were enclosed in parenthesis. For example, if the target string being matched against the regular expression was:

  Last modified: Mon Mar 31 15:10:16 2003

Then perl would remember that:

($day_re) in the regular expression matched the substring Mon in our target string
($month_re) in the regular expression matched the substring Mar in our target string
(..) in the regular expression matched the substring 31 in our target string
(....) in the regular expression matched the substring 2003 in our target string

Perl will then store these remembered substrings in the special variables $1, $2, $3 and $4 respectively. And we can then examine/use these variables inside the body of our conditional. These special variables are set whenever a regular expression successfully matches. Therefore, you should not use these variables unless you have actually tested that the regular expression matched first. Also, remember that because these variables will be overwritten on each successful match of a regular expression, introducing another regular expression match between your original regular expression match and your use of the special variables may cause unexpected results.

Perl also supports the use of backreferences which allow you to use the remember string itself inside a regular expression. See page 110 of S&P for details.

The regular expression in the above program lets some invalid dates through. For example, if a non-numeric year is specified, the regular expression will still match it, just as long as there are four characters at the end of the string that follow a space. We can tighten up this regular expression to reject these non-date strings by using character clases.

Character Classes

A character class provides a way for a regular expression to match one of a collection of characters. Character classes are denoted by the meta-characters [ and ]. For example, to represent the lowercase vowels, you can use the character class [aeiou] inside a regular expression. To represent the digits, you can use the character class [0123456789]. Perl allows you to specify a range of characters using the hyphen inside a character class. For example: [A-Za-z] represents all alphabetic characters. If you wish to use a hyphen inside a character class, then escape it with a backslash or use the hyphen first in the character class. For example [-az] and [a\-z] represent the character class containing the three characters a, - and z Note that the special meta-characters (except backslash) lose their meaning inside a character class. Therefore the character class [?*+.] will match any one of those characters literally.

It is possible to negate the characters in a class by using the ^ symbol as the very first character inside a character class. For example, the class [^aeiou] will match any character that is not a lowercase vowel.

Perl has several predefined character classes for use. For example, the sequence \s represents the character class [ \t\r\n\f], that is, any white space character. The sequence \d represents any digit and the sequence \w represents any word character, which includes alphabetic characters (upper and lower case), digits and the underscore character. We can represent negations of these classes by capitalizing the letter after the backslash. For example, the character class \S can be used to represent any non-whitespace character. We can also use these character classes inside other character classes. For example, [\d ] will match either a digit or a space.

Returning to our date parsing script, we can use the digit character class to represent days of the month and the year:


#!/usr/bin/perl -w

use strict;

my @days = qw/Sun Mon Tue Wed Thu Fri Sat/;
my @months = qw/Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec/;

my $day_re = join "|", @days;
my $month_re = join "|", @months;

while (<>) {
	if (/($day_re) ($month_re) (\d\d) ..:..:.. (\d{4})/) {
		print "Day Name: $1, Month $2, Day: $3, Year $4\n";
	}
}

This script would now reject dates unless numeric days of the months and numeric years were specified. This script also demonstrates another, lesser used quantifier, {}. In the context of the above script, this quantifier will ensure that there are four digits in the year. This quantifier can also specify a range (e.g. \w{3,10} would match the first three to ten characters in a word. It would fail to match if the word contained less than three characters). The quantifier can also be used with no lower limit. For example the regular expression \d{,3} would match upto three digits in a target string.

Last modified: Tue Apr 1 00:20:12 2003