Main

March 29, 2004 (Monday)

Regular Expressions (S&P — Chapters 7/8/9, cont'd)

Anchors

Perl regular expressions also support anchors which allow you to match regular expressions that occur at certain places inside a string. The two most common anchors are ^ and $, which match the beginning and end of a line, respectively. For example, the regular expression /^hello/ would match any string that had hello at the very beginning of the string. Likewise, the regular expression /world$/ would match any string that had world at the end.

We can also represent a word boundary with the \b anchor. Therefore, the regular expression /\bhello\b/ will match the strings hello! and hello,world, but not the string othello.

The following script will display the (alphabetic) words in a file that do not contain the upper or lower case vowels (y doesn't count as a vowel in this script).

#!/usr/bin/perl -w

use strict;

while (<>) {
	for (split) {
		next unless /^[a-z]+$/i;
		print "$_\n" if /^[^aeiou]+$/i;
	}
}
nonvowels.pl

On each iteration through the while loop, we split the line into its constituent words. We then move immediately onto the next word if the word we are currently looking at does not consist entirely of alphabetic characters (each string must consist entirely of alphabetic characters due to the anchoring of the regular expression at both ends of the string). The next key word is similar to the continue keyword in C and C++ — it brings control immediately back to the top of the innermost loop (which is the for loop in this case) and starts the next iteration.

The i after the terminating forward slash is an example of a regular expression modifier in Perl. The i modifier tells the regular expression to be case insensitive. Therefore, the regular expression given will match words which consist entirely of upper and/or lower case characters. The unless ... keyword is analogous to saying if !.... Therefore, saying

next unless /^[a-z]+$/i;

is identical to:

next if ! /^[a-z]+$/i;

Note that we can do a negation on the regular expression by using the ! operator. The resulting conditional will be true if the regular expression does not match $_.

Finally, in the second statement of the for loop we will display the word if none of the characters inside the word are equal to any of the (upper- or lower-case) vowels. Note that if we had just used the regular expression without the two anchors, (i.e. /[^aeiou]+/), then this would have matched any string that had at least one non-vowel character.

Backreferences in regular expressions

In regular expressions, it is possible to use backreferences to match a substring that was remembered (by using the parentheses) earlier in the regular expression. For example, consider the following script:

#!/usr/bin/perl -w

use strict;

my $sentence = "This is is a string of text.";

print "Repeated word!: $`->$&<-$'\n" if $sentence =~ /\b(\w+)\s+\1/;
br.pl

The regular expression attempts to find the first occurrence of a repeated word in the sentence. In the specific example above, the regular expression will match the $sentence scalar — the (\w+) will match the first is and the \1 will match the second is. The \1 refers to the string matched by the regular expression in the first pair of paretheses, which was (\w+). Note that we use a word boundary anchor, \b; otherwise, the sentence "This is a string of text." would match the regular expression because the is of This would match (\w+) and the \1 would match the word is that follows This.

The above example also demonstrates the use of the $`, $' and $& special Perl variables. After a regular expression match is performed, $` matches everything in the string that occurred before the match, $' represents everything in the string that occurred after the match and $& represents the match itself. Therefore, the above program will display:

Repeated word!: This ->is is<- a string of text.

Note that backreferences \2, \3 etc. can also be used, just as long as there are an appropriate number of sub-expressions grouped by parentheses. Attempting to use a backreference for which there isn't a corresponding grouping will cause an error.

Greedy vs Non-Greedy Regular Expression Quantifiers

Regular expressions such as /.*/ are greedy by default. That is, they try to match as many characters as possible. For example, consider the following incorrect expression which attempts to break a string of characters in two strings: the first one consisting of alphabetic characters and the second consisting of digits:

print "$1/$2\n" if "name12346789" =~ m/(.*)(\d+)/

This expression will display name12345678/9. We have two greedy regular expression in competition — the first subexpression, .*, will match nearly all the characters in the string, leaving only the last (digit) character to be matched by the second subexpression \d+. We can fix this by using a non-greedy quantifier:

print "$1/$2\n" if "name12346789" =~ m/(.*?)(\d+)/

Note the question mark in the subexpression .*? — this makes the .* regular expression non-greedy. In other words, it will match only a minimal number of characters in its attempt to satisfy the match. This will leave the digits in the string to be matched by the \d+ subexpression. As another example, consider the following Perl script that displays all the HTML tags found in an HTML file:

#!/usr/bin/perl -w

use strict;

while (<>) {
	print "$1\n" while /<(.*?)>/g;
}
tags.pl

Again, we use a non-greedy regular expression in order to match the characters in between the angle brackets that denote an HTML tag. Had we used the expression:

/<(.*)>/

then this greedy regular expression would match too much. For example, if the HTML file consisted of <html>hello</html>, then the greedy regular expression would match:

< ( .* ) >;
< html>hello</html >

That is, $1 would be html>hello</html. For the non-greedy regular expression, the inner while will run twice setting $1 to html the first time and /html the second time.

Note that the regular expression is being used in a scalar context (the regular expression is used in the conditional of the while). When used in this context, the result of the regular expression will be true if the regular expression succesfully matched, or false otherwise. With the g regular expression modifier, this regular expression will attempt to match the pattern on the remainder of the string on each subsequent evaluation. When the regular expression can no longer be matched, it returns false, thereby terminating the while loop.

Substitutions

Perl also supports substitutions with the s/// operator. The general format of this operator, as given in the perlop man page is s/PATTERN/REPLACEMENT/. By default, the substitution will take place on the default variable $_. However, as with regular expression matching using /.../, we can also use the binding operator, =~, to perform substitutions on any string.

For example, to compress all the spaces in the string denoted by the variable $str, we can write $str =~ s/ +/ /. This will search the string $str looking for an occurrence of one or more contiguous spaces. It will then replace them with a single space. Unfortunately, this will only compress the first occurrence of one or more spaces in $str. To compress them all we must use the g modifier on the end of the substitution: $str =~ s/ +/ /g. Better still, we can use the \s character class to compress all whitespace characters: $str =~ s/\s+/ /g;. The following short Perl script:

$_ = "Hello World\n";
s/(..)/<$1>/g;
print;

will display:

<He><ll><o ><Wo><rl>d

(Remember that . does not match \n.)

If we wanted to do case-insensitive substitutions, we can use the i modifier which will make the matching of PATTERN case-insensitive. For example, the Perl expression s/\bmun\b/MUN/ig; will search the string $_ and replace all words such as such as mun, Mun and MuN with MUN.

As another example of substitution, consider a Perl script which reads a list of student numbers names and term marks as demonstrated by the following test file:

366533091 Cole Kent               68
402545697 Andrew West             99
544149893 Angela Johnston         93
642776563 Monique Epps            83
257622129 Darko Peter             100
033221495 Gregory Salutue         55
582335451 Ola Svallmark           64
211817030 Gina Simpson            97
951569403 Ela Whiteside           85
899563658 Brian Garrett           92
433097365 Georgett Lott           57
168213321 Candi Lilly             92
051534180 Linda Smith             61
715231817 Sara Rossy              64
183995480 Kimberlee Thomson       53
834110872 Nazmeen Gorzoch         61
276976781 Vic Melvin              56
017101413 Jack Snede              48
389869517 Hank Thomas             73
826916025 Andrew Harkin           50
marks.txt

The below script processes each line of the file, making sure that each is valid. As part of this processing, it also extracts the ID, name and mark from each line. It will then change the order of the first and last names and capitalize the last name. Note that we can remember the matches in the PATTERN part of the substitution operation and refer to them in the REPLACEMENT part by using $1, $2 etc. The \U sequence will cause all letters that occur after it to be uppercased until the \E sequence is encountered.

The line is then formatted for output. However, instead of displaying the line immediately, it is stored in the @lines array. This allows us to sort the output in a variety of ways. By default, just using the sort function will sort each line lexicographically (which gives us an ordering by student number, because all the student numbers have the same number of digits at the beginning of each of the strings in the @lines array). We can also sort by grade as well by making use of the split function inside an anonymous compare subroutine and extracting the last element from the array returned by split by using -1 as the index variable. Note that we must enclose the entire split operation in parentheses; otherwise, the indexing operation will attempt to take place on the $a and $b variables in the anonymous subroutine.

#!/usr/bin/perl -w

use strict;

my @lines;
while (<>) {
	die "Line $.: Invalid line\n" unless /^(\d{9})\s+(.*)\s+(\d+)$/;
	my ($num, $name, $mark) = ($1, $2, $3);
	$name =~  s/(\w+) (\w+)/\U$2\E, $1/;
	push @lines, sprintf "%09d %-25s %3d\n", $num, $name, $mark;
}

print "Students sorted by number:\n";
print sort @lines;

print "\nStudents sorted by decreasing mark:\n";
print sort { (split ' ', $b)[-1] <=> (split ' ', $a)[-1] } @lines;
marks.pl

The special variable $. in Perl represents the current line number of the file being read by the Perl script. The variable is useful when printing diagnostic information about an input file being parsed by Perl.

Here is the output from the above program when run on the input data given above:

A Final Example

Here is a Perl script that demonstrates a way to parse files which have a format similar to the following example:

(The parser below actually lets things through that it shouldn't but it's okay for demonstration purposes.)

#!/usr/bin/perl -w

use strict;

sub trim_spaces {
	my ($str) = @_;

	$str =~ s/\s+$//;
	$str =~ s/^\s+//;
	return $str;
}


while (<>) {
	chomp;
	next if /^\s*#/;	# Ignore lines with comments.
	next if /^\s*$/;	# Ignore empty lines.
	s/#.*//;		# Remove comments
	if (/^\s*\[(.*)\]\s*$/) {
		my $sec = &trim_spaces($1);
		print "section name: '$sec'\n";
	} elsif (/^\s*(.*)\s*=\s*(.*)\s*$/) {
		my $attr = &trim_spaces($1);
		my $val = &trim_spaces($2);
		print "attribute '$attr' equals '$val'\n";
	} else {
		print "Line $. invalid: '$_'\n";
	}
}
parse.pl

The parser examines each line in the file and skips over lines that consist of only a comment or are empty. It then strips off comments that appear on non-empty lines. The script then tests the line against a couple of regular expressions searching for a match. When it finds a match, it strips any leading or trailing spaces from the relevant substrings that were matched by the regular expression and displays them.


Last modified: March 30, 2004 15:12:28 NST (Tuesday)