Computer Science 3710 -- April 02, 2004

File tests and Directory operations (S&P — Chapters 11 and 12)

The script below, when run on one of the Computer Science machines will generate a file in the current directory named course.html which contains all the 3710 online course notes for the current semester in a single HTML file. You should note the following regarding the generated file:

It is quite large (over 1/2 of a megabyte, at the current time of writing) so it may take a web browser a while to render it (it consists of about 200 pages when printed).

Clicking on the links to the code samples will not work correctly (Not Found errors will be generated).

Despite the above shortcomings, the script demonstrates many new features in Perl that we have not seen yet. In particular, it demonstrates the use of file tests, some directory operations, extracting user information from the password file, file globbing, and introduces how we run and store the output of external system commands (e.g. wget) in Perl. (This last topic is expanded further in subsequent chapters.)

#!/usr/bin/perl -w

use strict;

my $output = "course.html";
die "File '$output' exists!" if -e $output;
open OUT, "> $output" or die "Cannot open '$output': $!\n";

defined(my $dir = (getpwnam "donald")[7]) or die "No such user!";

$dir .= "/.www/comp3710/diary/";

undef $/;
open STYLE, "$dir/styles.css" or die "Cannot find style file: $!";
my $style = <STYLE>;
close STYLE;

print OUT <<"END";
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
	"http://www.w3.org/TR/html4/strict.dtd">
<html>
	<head>
	<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
	<title>Course Notes &mdash; Computer Science 3710 (Winter 2004)</title>
	<style type="text/css">
$style
	</style>
	</head>
<body>
END

$dir .= "2004/";
die "No such directory '$dir'" if ! -d $dir;

chdir $dir or die "Cannot change to '$dir': $!\n";

for (glob "*/*/index.php") {
	if (! -r) {
		print "File $_ is not readable.";
		last;
	} elsif (! -s) {
		print "File $_ is empty.";
		next;
	}
	my $url = "http://www.cs.mun.ca/~donald/comp3710/diary/2004/$_";
	print "Getting $url...\n";
	my $result = `/usr/bin/wget -O- $url 2>/dev/null`;
	die "Not able to retrieve web page!" unless $result;
	my ($html) = ($result =~ m{<body>(.*)</body>}s);
	print OUT $html;
	print OUT "<hr>\n"
}
print OUT "</body>\n</html>";
close OUT;
notes.pl

One of the very first thing we do is open a file, course.html, for output. This file will contain all the course notes in a single HTML file. Before the file is opened, we make sure that it doesn't already exist. To do so we use the file test operator -e and specify the filename as the operand. This operator will return a true value if the specified file name exists.

After the output file has been successfully opened, the script then gets my home directory by consulting the password file. The getpwnam function accepts a user name and returns an array of ten elements which represent various user attributes (do a perdoc -f getpwnam for more information), the eighth of which is the user's home directory. So we can extract the eighth element by enclosing the call the getpwname in parentheses and applying the array indexing operator to it. We then append the subdirectory of the online notes to the home directory stored in $dir.

We then open and grab the entire contents of the style sheet file that is used by all pages of the online notes. To make this easier, we undefine the input record separator variable (undef $/). The input record separator determines the delimiter to use when doing input using the input operator <...>. The input record separator is set to newline by default, therefore the input operator will do input a line at a time, as we have already seen. By undeffing the input record separator, we are telling Perl that the input operator should input everything. This is not always a wise thing to do, especially if everything consists of a 20 gigabyte file. For more information on the input record separator (and for information on all of Perl's so-called special variables,) check out the perlvar man page.

We store the entire contents of the style sheet file in the $style variable.

We then write appropriate HTML header lines to our output file. To do this, we use a here document:

print OUT <<"END";
	...
END

Everything between the two END delimiters will be displayed verbatim. Note, however, that interpolation of the $style scalar variable still takes place inside the here document. Remember that there is no comma after the file handle (which is OUT in the code above).

We then update the $dir variable to denote the base directory of the actual diary entries. We then use the -d file test on the resulting string to make sure that the string represents a valid directory on the file system. If so, then we change to that directory using the chdir Perl function. If changing to the directory fails, then we will terminate the script with an appropriate error message (that includes contents of the special variable $! that indicates why the chdir function failed).

If the chdir function succeeds, we then start our for loop giving it the expression

glob
"*/*/index.php"

. The glob function will return a list of all the files that match the specified glob pattern. The first two asterisks will match all the month and day subdirectories in the diary directory. The index.php, of course, will match each actual diary file itself. Do not confuse glob patterns with regular expression patterns — the two are different. Note that we could have used the glob pattern:

for (glob "$dir/*/*/index.html") {
...
}

for our for loop. This would make the call to chdir unnecessary, but at the same time would make the lengths of each scalar inside the list a fair bit longer.

Inside the while loop we perform a couple of file tests to make sure that the file is readable, using -r, and is of non-zero size, using ! -s. The file test operators will use the default variable, but you have to be careful when doing this (see p. 162 of S&P). Note that Perl uses the elsif keyword and not else if, the latter actually causes a syntax error in Perl. Also note that if we encounter a file that we cannot read, we print out an error message and terminate the for loop using the last operator. This operator is similar to break in C and C++. If, however, the file is empty, the next operator is used which will skip over the rest of the loop body and move to the next file in the globbed list.

We then use the name of the file to construct a URL which we can use to retrieve the web page. We run the (external) wget program to acquire the web page. This can be done by including the command inside backquotes:

my $result = `/usr/bin/wget -O- $url 2>/dev/null`;

This will run the wget program and store the standard output from the program in the $result scalar. (The -O- option to wget causes it to write the retrieved HTML file to standard output rather than write it to a file.)

If we successfully retrieved data from the web server, then we execute the somewhat crytpic command:

my ($html) = ($result =~ m{<body>(.*)</body>}s);

We can break this statement down into a couple of steps:

We take the HTML that was retrieved by wget and match it against the regular expression m{<body>(.*)</body>}s. This will match the body of the HTML file and will remember the body's contents (without the two body tags). Note that we use the s option modifier for the regular expression. This will cause the . character to match any character including the newline character. (Remember that dot does not match newline by default.)
We then perform the outer-most assignment:
```
	my ($html) = result of match
```
Because the variable $html is enclosed in parenthesis, the assignment is going to take place in array context. When interpreted in array context, the result of a match will be a list containing all the substrings that were matched by the ``capturing'' parenthesis in the regular expression. When interpreted in scalar context, the result of the match would be simply true or false. Because we want to get the substring matched by the (.*) regular expression, we must do this assignment in array context. As a result of this assignment, $html will be equal to the text delimited by the <body>...</body> tags in the HTML file. If, instead, we had written:
```
	my $html = result of match
```
Then $html will be set to 1 and the generated output would not be what we expected.

At the end of the for loop body, we write the extracted body of the HTML file that was retrieved by wget to the output file.

After all the files have been processed, we output appropriate end tags to finish our generated HTML file and we close the output file. If the program died before we had a change to close the output file (for example, if we couldn't change to the diary directory), then Perl will close the output file that we had opened for us, flushing any pending output to the file. Perl will also close a file automatically if you attempt to re-open its corresponding HANDLE.

Once you've generated the web page, you can load it in into a web browser such as firefox or konqueror and ``print'' it to a PostScript file. (This may take a minute or two.) Then, you can use the psnup utility to convert the PostScript file into another PostScript file that has two pages per sheet:

$ psnup -2 -pletter course.ps > course-2up.ps
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] 
[18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] 
[33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] 
[48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] 
[63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] 
[78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] 
[93] Wrote 93 pages, 5128846 bytes
$

The resulting course-2up.ps file can then be printed.

Last modified: April 7, 2004 15:43:00 NDT (Wednesday)

April 02, 2004 (Friday)

File tests and Directory operations (S&P — Chapters 11 and 12)