Wednesday, April 02, 2003

Regular expressions (cont'd)

Anchors

Perl regular expressions also support anchors which allow you to match regular expressions that occur at certain places inside a string. The two most common anchors are ^ and $, which match the beginning and end of a line, respectively. For example, then regular expression ^hello would match any string that had hello at the very beginning of the string. Likewise, the regular expression world$ would match any string that had world at the end.

We can also represent word boundary with the code \b anchor. Therefore, the regular expression \bhello\b will match the strings hello! and hello,world, but not the string othello.

The following script will display the (alphabetic) words in a file that do not contain the traditional upper or lower case vowels.


#!/usr/bin/perl -w

use strict;

while (<>) {
	for (split) {
		next unless /^[a-z]+$/i;
		print "$_\n" if /^[^aeiou]+$/i;
	}
}

On each iteration through the while loop, we split the line into its constituent words. We then move immediately onto the next word if the word we are currently looking at does not consist entirely of alphabetic characters (each string must consist entirely of alphabetic characters due to the anchoring of the regular expression at both ends of the string). The next key word is similar to the continue keyword in C and C++ -- it brings control immediately back to the top of the inner most loop (which is the for loop in this case) and starts the next iteration. The i after the close forward slash is an example of a regular expression modifier in perl. The i modifier tells the regular expression to be case insensitive. Therefore, the regular expression given will match words which consist entirely of upper and lower case characters. The unless ... keyword is analogous to saying if !.... Therefore, saying

next unless /^[a-z]+$/i;

is identical to:

next if ! /^[a-z]+$/i;

(Note that we can do a negation on the regular expression by using the ! operator.)

Finally, in the second statement of the for loop we will display the word if none of the characters inside the word are equal to any of the (upper- or lower-case) vowels. Note that if we had just used the regular expression without the two anchors, (i.e. [^aeiou]+), then this would have matched any string that had at least one non-vowel character.

Substitutions

Perl also supports substitutions with the s/// operator. The general format of this operator, as given in the perlop man page is s/PATTERN/REPLACEMENT/. By default, the substitution will take place on the default variable $_. However, as with regular expression matching using /.../, we can also use the binding operator to perform substitutions on any string.

For example, to compress all the spaces in the string denoted by variable $str, we can write $str =~ s/ +/ /. This will search the string $str looking for an occurrence of one or more contiguous spaces. It will then replace them with a single space. Unfortunately, this will only compress the first occurrence of one or more spaces in $str. To compress them all we must use the g modifier on the end of the substitution: $str =~ s/ +/ /g. The i modifier is also supported so that the PATTERN match will be case insensitive.

As another example of substitution, consider the perl script below which reads a list of student numbers names and term marks as demonstrated by the following test file:


366533091 Cole Kent               68
402545697 Andrew West             99
544149893 Angela Johnston         93
642776563 Monique Epps            83
257622129 Darko Peter             100
033221495 Gregory Salutue         55
582335451 Ola Svallmark           64
211817030 Gina Simpson            97
951569403 Ela Whiteside           85
899563658 Brian Garrett           92
433097365 Georgett Lott           57
168213321 Candi Lilly             92
051534180 Linda Smith             61
715231817 Sara Rossy              64
183995480 Kimberlee Thomson       53
834110872 Nazmeen Gorzoch         61
276976781 Vic Melvin              56
017101413 Jack Snede              48
389869517 Hank Thomas             73
826916025 Andrew Harkin           50

The script to process each line of the file, making sure that it is valid. It will then change the order of the first and last names and capitalize the last name. Note that we can remember the matches in the PATTERN part of the substitution operation and refer to the in the REPLACEMENT part by using $1, $2 etc. The \U sequence will cause all letters that occur after it to be uppercased until the \E sequence is encountered.

The line is then formatted for output. However, instead of displaying the line immediately, it is stored in the @lines array. This allows us to sort the output in a variety of ways. By default, just using the sort function will sort each line lexicographically (which gives us an ordering by student number, because all the student numbers have the same number of digits). We can also sort by grade as well by making use of the split function inside an anonymous compare subroutine and extracting the last element from the array returned by split by using -1 as the index variable. Note that we must enclose the entire split operation in parenthesis; otherwise, the indexing operation will attempt to take place on the $a and $b variables in the anonymous subroutine.


#!/usr/bin/perl -w

use strict;

my @lines;
while (<>) {
	die "Line $.: Invalid line\n" unless /^(\d{9})\s+(.*)\s+(\d+)$/;
	my ($num, $name, $mark) = ($1, $2, $3, $4);
	$name =~  s/(\w+) (\w+)/\U$2\E, $1/;
	push @lines, sprintf "%09d %-25s %3d\n", $num, $name, $mark;
}

print "Students sorted by number:\n";
for (sort @lines) {
	print;
}

print "\nStudents sorted by decreasing mark:\n";
for (sort { (split ' ', $b)[-1] <=> (split ' ', $a)[-1] } @lines) {
	print;
}

The special variable $. in perl represents the current line number of the file being read by the perl script. The variable is useful when printing diagnostic information about an input file being parsed by perl.

Here is the output from the above program when run on the input data given above:


Students sorted by number:
017101413 SNEDE, Jack                48
033221495 SALUTUE, Gregory           55
051534180 SMITH, Linda               61
168213321 LILLY, Candi               92
183995480 THOMSON, Kimberlee         53
211817030 SIMPSON, Gina              97
257622129 PETER, Darko              100
276976781 MELVIN, Vic                56
366533091 KENT, Cole                 68
389869517 THOMAS, Hank               73
402545697 WEST, Andrew               99
433097365 LOTT, Georgett             57
544149893 JOHNSTON, Angela           93
582335451 SVALLMARK, Ola             64
642776563 EPPS, Monique              83
715231817 ROSSY, Sara                64
826916025 HARKIN, Andrew             50
834110872 GORZOCH, Nazmeen           61
899563658 GARRETT, Brian             92
951569403 WHITESIDE, Ela             85

Students sorted by decreasing mark:
257622129 PETER, Darko              100
402545697 WEST, Andrew               99
211817030 SIMPSON, Gina              97
544149893 JOHNSTON, Angela           93
168213321 LILLY, Candi               92
899563658 GARRETT, Brian             92
951569403 WHITESIDE, Ela             85
642776563 EPPS, Monique              83
389869517 THOMAS, Hank               73
366533091 KENT, Cole                 68
715231817 ROSSY, Sara                64
582335451 SVALLMARK, Ola             64
834110872 GORZOCH, Nazmeen           61
051534180 SMITH, Linda               61
433097365 LOTT, Georgett             57
276976781 MELVIN, Vic                56
033221495 SALUTUE, Gregory           55
183995480 THOMSON, Kimberlee         53
826916025 HARKIN, Andrew             50
017101413 SNEDE, Jack                48

A Final Example

Here is a perl script that demonstrates a way to parse files which have a format similar to the following example:


# This is a test file.

[Startup]
directory = /users/cs/study	# Testing comment
printer   = linuxlj
groupid   = 9002

[Shutdown]	# Comment test.
confirm   = true
reboot    = false

(The parser below actually lets things through that it shouldn't but it's okay for demonstration purposes.)


#!/usr/bin/perl -w

use strict;

sub trim_spaces {
	my ($str) = @_;
	return $str;

	$str =~ s/\s+$//;
	$str =~ s/^\s+//;
	return $str;
}


while (<>) {
	chomp;
	next if /^\s*#/;	# Ignore lines with comments.
	next if /^\s*$/;	# Ingnore empty lines.
	s/#.*//;		# Remove comments
	if (/^\s*\[(.*)\]\s*$/) {
		my $sec = &trim_spaces($1);
		print "section name: '$sec'\n";
		next;
	} elsif (/^\s*(.*)\s*=\s*(.*)\s*$/) {
		my $attr = &trim_spaces($1);
		my $val = &trim_spaces($2);
		print "attribute '$attr' equals '$val'\n";
		next;
	} else {
		print "Line $. invalid: '$_'\n";
	}
}

The parser examines each line in the file and skips over lines that consist of only a comment or are empty. It then strips off comments that appear on non-empty lines. The script then tests the line against a couple of regular expressions searching for a match. When it finds a match, it strips any leading or trailing spaces from the relevant substrings that were matched by the regular expression and displays them.

File Handles (S&P -- Chapter 11)

Like C's fopen() and fclose() function, perl supports a means of doing input and output to a file. To demonstrate File I/O in perl, consider the following script:


#!/usr/bin/perl -w

use strict;

my ($passwd, $results) = qw< /etc/passwd results.out >;

my %shells;
open FILE, $passwd or die "Cannot open password file: $!\n";
while (<FILE>) {
	chomp;
	next if /ppp/;
	$shells{(split /:/)[-1]} ++;
}
close FILE or die "Cannot close password file: $!\n";

open RES, "> $results" or die "Cannot open '$results' for write: $!\n";
for (sort { $shells{$b} <=> $shells{$a} } keys %shells ) {
	print RES "$_: $shells{$_}\n";
}
close RES or die "Cannot close $results'";

There are several things to note about the above script:

We open a file using perl's open function. This function typically takes two arguments: a file handle and a scalar representing the name of the file in the file system to open. The file handle represents the connection between your perl script and the file itself and is typically written in all capital letters. We've already seen one file handle already: STDIN. By default, open will open the file for read access. We'll see opening for write access later.
If the open call fails, it will return a false value. Because the open function is being called as part of an or logical operation, the second statement to the right of the or operation will have to be evaluated. This statement causes termination of the program. This is a very common idiom in perl. Note that the or logical operator is of lower precedence than the traditional || operator. If we wanted to use the more conventional || operator, we would have to put the parameters of open in parenthesis in order to ensure that the precedence of the operators inside the statement make sense.
```
open (FILE, $passwd) || die "Cannot open password file: $!\n";
```
A lot of (older) perl code uses the || operator but using or in the above context seems to be increasing in popularity.
The $! special variable using the string argument to the die function will be interpolated to an error message that explains why the most recent function call failed. In the code above, the $! variable will contain a string that indicates why the script was not able to open the file. For example, if we tried to open a file that we did not have read permission on, then $! will be set to the string: Permission denied. If the file did not exist, then $! will be set to No such file or directory. The contents of the $! variable are quite useful and should be displayed when one of perl's function calls fail for some reason.
After we open the file, we can then start reading from it. The notation <FILE> can be used to read a line from the file handle denoted by FILE. Again, we've seen this notation before when reading lines from STDIN. Because we are using the <FILE> notation inside a while condition, the $_ default variable will be set to each line in the opened file on each iteration through the loop.
Inside the body of the loop, after we chomp the line, we skip over lines that have ppp in them. We then increment a hash counter. The key in this has is the shell that is used by the current system user read in from the passwd file. Note that the statement (split /:/)[-1] will do two things: It will first split the line of input using the regular expression : as a delimiter (each line of the passwd is delimited by colons). The result of the split function is an array of elements. The statement will then use the indexing operator to access the last element in this array. Using a negative number the square brackets is a simple way of indexing an list starting from the back end of the list. Note that the parenthesis around the call to split are compulsory, since the indexing operation must operate on an list that the split function generates.
After we are finished reading the contents of the file, we should close the file using the close function on the filehandle that we opened earlier. While it is not too common, the close operation can fail so it should be tested like the open function just to be safe, although in practice this is rarely done.
Next, we open a new file denoted by the $result scalar which will be used to store the results of the script. When opening a file for write access, we give the filename the special first character of >. This will cause perl to open the filename for write, thereby overwriting the file if it existed earlier. As with opening for read, we implicitly check the result of the open operation and die (with an appropriate error message using $!) if we were unable to open the file for write access
Finally, we sort the keys in the hash by value and store the results to our file handle. In order to write data to a file handle, we can use the print (or printf) function in a special manner: The first argument to the print function is the name of the file name with which we opened the file. The second argument is a list representing the information that we want to store in the file.
```
print RES "$_: $shells{$_}\n";  	# NO comma following RES!!
```
Note that there is no comma separating the file handle from the list. This last point is very important.

Last modified: Fri Apr 4 14:37:49 2003