Wednesday, April 02, 2003

Regular expressions (cont'd)

Anchors

Perl regular expressions also support anchors which allow you to match regular expressions that occur at certain places inside a string. The two most common anchors are ^ and $, which match the beginning and end of a line, respectively. For example, then regular expression ^hello would match any string that had hello at the very beginning of the string. Likewise, the regular expression world$ would match any string that had world at the end.

We can also represent word boundary with the code \b anchor. Therefore, the regular expression \bhello\b will match the strings hello! and hello,world, but not the string othello.

The following script will display the (alphabetic) words in a file that do not contain the traditional upper or lower case vowels.


#!/usr/bin/perl -w

use strict;

while (<>) {
	for (split) {
		next unless /^[a-z]+$/i;
		print "$_\n" if /^[^aeiou]+$/i;
	}
}


On each iteration through the while loop, we split the line into its constituent words. We then move immediately onto the next word if the word we are currently looking at does not consist entirely of alphabetic characters (each string must consist entirely of alphabetic characters due to the anchoring of the regular expression at both ends of the string). The next key word is similar to the continue keyword in C and C++ -- it brings control immediately back to the top of the inner most loop (which is the for loop in this case) and starts the next iteration. The i after the close forward slash is an example of a regular expression modifier in perl. The i modifier tells the regular expression to be case insensitive. Therefore, the regular expression given will match words which consist entirely of upper and lower case characters. The unless ... keyword is analogous to saying if !.... Therefore, saying

next unless /^[a-z]+$/i;

is identical to:

next if ! /^[a-z]+$/i;

(Note that we can do a negation on the regular expression by using the ! operator.)

Finally, in the second statement of the for loop we will display the word if none of the characters inside the word are equal to any of the (upper- or lower-case) vowels. Note that if we had just used the regular expression without the two anchors, (i.e. [^aeiou]+), then this would have matched any string that had at least one non-vowel character.

Substitutions

Perl also supports substitutions with the s/// operator. The general format of this operator, as given in the perlop man page is s/PATTERN/REPLACEMENT/. By default, the substitution will take place on the default variable $_. However, as with regular expression matching using /.../, we can also use the binding operator to perform substitutions on any string.

For example, to compress all the spaces in the string denoted by variable $str, we can write $str =~ s/ +/ /. This will search the string $str looking for an occurrence of one or more contiguous spaces. It will then replace them with a single space. Unfortunately, this will only compress the first occurrence of one or more spaces in $str. To compress them all we must use the g modifier on the end of the substitution: $str =~ s/ +/ /g. The i modifier is also supported so that the PATTERN match will be case insensitive.

As another example of substitution, consider the perl script below which reads a list of student numbers names and term marks as demonstrated by the following test file:


366533091 Cole Kent               68
402545697 Andrew West             99
544149893 Angela Johnston         93
642776563 Monique Epps            83
257622129 Darko Peter             100
033221495 Gregory Salutue         55
582335451 Ola Svallmark           64
211817030 Gina Simpson            97
951569403 Ela Whiteside           85
899563658 Brian Garrett           92
433097365 Georgett Lott           57
168213321 Candi Lilly             92
051534180 Linda Smith             61
715231817 Sara Rossy              64
183995480 Kimberlee Thomson       53
834110872 Nazmeen Gorzoch         61
276976781 Vic Melvin              56
017101413 Jack Snede              48
389869517 Hank Thomas             73
826916025 Andrew Harkin           50


The script to process each line of the file, making sure that it is valid. It will then change the order of the first and last names and capitalize the last name. Note that we can remember the matches in the PATTERN part of the substitution operation and refer to the in the REPLACEMENT part by using $1, $2 etc. The \U sequence will cause all letters that occur after it to be uppercased until the \E sequence is encountered.

The line is then formatted for output. However, instead of displaying the line immediately, it is stored in the @lines array. This allows us to sort the output in a variety of ways. By default, just using the sort function will sort each line lexicographically (which gives us an ordering by student number, because all the student numbers have the same number of digits). We can also sort by grade as well by making use of the split function inside an anonymous compare subroutine and extracting the last element from the array returned by split by using -1 as the index variable. Note that we must enclose the entire split operation in parenthesis; otherwise, the indexing operation will attempt to take place on the $a and $b variables in the anonymous subroutine.


#!/usr/bin/perl -w

use strict;

my @lines;
while (<>) {
	die "Line $.: Invalid line\n" unless /^(\d{9})\s+(.*)\s+(\d+)$/;
	my ($num, $name, $mark) = ($1, $2, $3, $4);
	$name =~  s/(\w+) (\w+)/\U$2\E, $1/;
	push @lines, sprintf "%09d %-25s %3d\n", $num, $name, $mark;
}

print "Students sorted by number:\n";
for (sort @lines) {
	print;
}

print "\nStudents sorted by decreasing mark:\n";
for (sort { (split ' ', $b)[-1] <=> (split ' ', $a)[-1] } @lines) {
	print;
}


The special variable $. in perl represents the current line number of the file being read by the perl script. The variable is useful when printing diagnostic information about an input file being parsed by perl.

Here is the output from the above program when run on the input data given above:


Students sorted by number:
017101413 SNEDE, Jack                48
033221495 SALUTUE, Gregory           55
051534180 SMITH, Linda               61
168213321 LILLY, Candi               92
183995480 THOMSON, Kimberlee         53
211817030 SIMPSON, Gina              97
257622129 PETER, Darko              100
276976781 MELVIN, Vic                56
366533091 KENT, Cole                 68
389869517 THOMAS, Hank               73
402545697 WEST, Andrew               99
433097365 LOTT, Georgett             57
544149893 JOHNSTON, Angela           93
582335451 SVALLMARK, Ola             64
642776563 EPPS, Monique              83
715231817 ROSSY, Sara                64
826916025 HARKIN, Andrew             50
834110872 GORZOCH, Nazmeen           61
899563658 GARRETT, Brian             92
951569403 WHITESIDE, Ela             85

Students sorted by decreasing mark:
257622129 PETER, Darko              100
402545697 WEST, Andrew               99
211817030 SIMPSON, Gina              97
544149893 JOHNSTON, Angela           93
168213321 LILLY, Candi               92
899563658 GARRETT, Brian             92
951569403 WHITESIDE, Ela             85
642776563 EPPS, Monique              83
389869517 THOMAS, Hank               73
366533091 KENT, Cole                 68
715231817 ROSSY, Sara                64
582335451 SVALLMARK, Ola             64
834110872 GORZOCH, Nazmeen           61
051534180 SMITH, Linda               61
433097365 LOTT, Georgett             57
276976781 MELVIN, Vic                56
033221495 SALUTUE, Gregory           55
183995480 THOMSON, Kimberlee         53
826916025 HARKIN, Andrew             50
017101413 SNEDE, Jack                48


A Final Example

Here is a perl script that demonstrates a way to parse files which have a format similar to the following example:


# This is a test file.

[Startup]
directory = /users/cs/study	# Testing comment
printer   = linuxlj
groupid   = 9002

[Shutdown]	# Comment test.
confirm   = true
reboot    = false


(The parser below actually lets things through that it shouldn't but it's okay for demonstration purposes.)


#!/usr/bin/perl -w

use strict;

sub trim_spaces {
	my ($str) = @_;
	return $str;

	$str =~ s/\s+$//;
	$str =~ s/^\s+//;
	return $str;
}


while (<>) {
	chomp;
	next if /^\s*#/;	# Ignore lines with comments.
	next if /^\s*$/;	# Ingnore empty lines.
	s/#.*//;		# Remove comments
	if (/^\s*\[(.*)\]\s*$/) {
		my $sec = &trim_spaces($1);
		print "section name: '$sec'\n";
		next;
	} elsif (/^\s*(.*)\s*=\s*(.*)\s*$/) {
		my $attr = &trim_spaces($1);
		my $val = &trim_spaces($2);
		print "attribute '$attr' equals '$val'\n";
		next;
	} else {
		print "Line $. invalid: '$_'\n";
	}
}


The parser examines each line in the file and skips over lines that consist of only a comment or are empty. It then strips off comments that appear on non-empty lines. The script then tests the line against a couple of regular expressions searching for a match. When it finds a match, it strips any leading or trailing spaces from the relevant substrings that were matched by the regular expression and displays them.

File Handles (S&P -- Chapter 11)

Like C's fopen() and fclose() function, perl supports a means of doing input and output to a file. To demonstrate File I/O in perl, consider the following script:


#!/usr/bin/perl -w

use strict;

my ($passwd, $results) = qw< /etc/passwd results.out >;

my %shells;
open FILE, $passwd or die "Cannot open password file: $!\n";
while (<FILE>) {
	chomp;
	next if /ppp/;
	$shells{(split /:/)[-1]} ++;
}
close FILE or die "Cannot close password file: $!\n";

open RES, "> $results" or die "Cannot open '$results' for write: $!\n";
for (sort { $shells{$b} <=> $shells{$a} } keys %shells ) {
	print RES "$_: $shells{$_}\n";
}
close RES or die "Cannot close $results'";


There are several things to note about the above script:

Last modified: Fri Apr 4 14:37:49 2003