Friday, April 04, 2003

File tests and Directory operations (S&P -- Chapters 11 and 12)

The script below, when run on one of the CS machines will generate a file in the current directory named course.html which contains all the 3710 course notes in a single web page.

There are, however, a few problems with the generated file:

It is quite large (nearly 1/3 of a megabyte, at the current time of writing) so it may take a web browser a while to render it (it consists of about 125 pages when printed).
It is not strictly HTML compliant (there is no DOCTYPE line for example).
The links to the code samples will not work correctly.
The style sheet file is not present. Therefore, amongst other problems, the code samples will be displayed underlined in most browsers.

Despite the above shortcomings, the script demonstrates many new features in Perl that we have not seen yet. In particular, it demonstrates the use of file tests, some directory operations, extracting user information from the password file, file globbing, and the special variable $/.


#!/usr/bin/perl -w

use strict;

my $output = "course.html";
die "File '$output' exists!" if -e $output;
open OUT, "> $output" or die "Cannot open '$output': $!\n";

print OUT "<html>\n<body>";

defined(my $dir = (getpwnam("donald"))[7]) or die "No such user!";

$dir .= "/.www/comp3710/diary/2003";
die "No such directory '$dir'" if ! -d $dir;

chdir $dir or die "Cannot change to '$dir': $!\n";

undef $/;

for (glob "*/*/index.html") {
	if (! -r) {
		print "File $_ is not readable.";
		last;
	} elsif (! -s) {
		print "File $_ is empty.";
		next;
	}
	open HTML, $_ or die "Cannot open file '$_': $!\n";
	my ($html) = (<HTML> =~ m{<body>(.*)</body>}s);
	close HTML;
	print OUT $html;
}
print OUT "</body>\n</html>";
close OUT;

One of the very first thing we do is open a file, (course.html) for output. This file will contain all the HTML course notes. Before the file is opened, we make sure that it doesn't already exist. To do so we use the file test operator -e and specify the filename as the operand. This operator will return a true value if the specified file name exists.
After the output file has been successfully opened for output, we write a couple of minimal html header lines to it. Remember that there is no comma after the file handle (which is OUT in the code above.
The script then gets my home directory by consulting the password file. The getpwnam function accepts a user name and returns an array of ten elements which represent various user attributes (do a perdoc -f getpwnam for more information), the eighth of which is the directory. So we can extract the eighth element by enclosing the call the getpwname in parenthesis and applying the array indexing operator to it.
We then append the subdirectory to my home directory to get the directory containing all the diary entries. We then use the -d file test on the resulting string to make sure that the string represents a valid directory on the file system. If so, then we chang to that directory using the chdir perl function. If changing to the directory fails, then we will terminate the script with an appropriate error message (that includes contents of the special variable that indicates why the chdir function failed).
If the chdir function success, we then undefine the input record separator variable (undef $/). The input record separator determines the delimiter to use when doing input using the input operator < ... >. The input record separator is set to newline by default, therefore the input operator will do input a line at a time, as we have already seen. By undeffing the separator, we are telling perl that the input operator should input everything. This is not always a wise thing to do, especially if everything consists of a 20 gigabyte file. However, as we will see below, it makes the extraction of the relevant HTML portions of our diary files easier.
For more information on the input record separator (and for information on all of perl's ``special variables,'' check out the perlvar man page.
We then start our for loop giving it the expression glob "*/*/index.html. The glob function will return a list of all the files that match the specified glob pattern. The first two asterisks will match all the months and days subdirectories in the diary directory. The index.html, of course, will match each actual diary file itself. Note that we could have used the glob pattern:
```
for (glob "$dir/*/*/index.html") {
```
for our for loop. This would make the call to chdir unnecessary, but at the same time would make the lengths of each scalar inside the list a fair bit longer.
Inside the while loop we perform a couple of more file tests to make sure that the file is readable -r and is of non-zero size ! -s. The file test operators will use the default variable, but you have to be careful when doing this (see p. 162 of S&P). Note that perl uses elsif and not else if, the latter actually causes a sytax error in perl. Also note that if we encounter a file that we cannot read, we print out an error message and terminate the for loop using the last operator. This operator is similar to break in C and C++. If, however, the file is empty, the next operator is used which will skip over the rest of the loop body and move to the next file in the globbed list.
We then open the HTML file for read (note that open does not use $_ by default, so we have to specify it explicitly.) If the open succeeds, then we execute the somewhat crytpic command:
```
my ($html) = (<HTML> =~ m{<body>(.*)</body>}s);
```
We can break this statement down into a few steps:
1. First, we use the input operator (<HTML>) to do input from the file we just opened. Remember that because we undeffed the input record separator, this will input the entire contents of the file.
2. We then take all this input and match it against the regular expression m{<body>(.*)</body>}s. This will match the body of the HTML file and will save the body (without the two delimiters). Note that we use the s option modifier for the regular expression. This will cause the . character to match any character including the newline character. (Remember that dot does not match newline by default).
3. We then perform the outer-most assignment:
```
	my ($html) = result of match
```
  Because the variable $html is enclosed in parenthesis, the assignment is going to take place in array context. When interpreted in array context, the result of a match will be a list containing all the substrings that were matched by the ``capturing'' parenthesis in the regular expression. When interpreted in scalar context, the result of the match would be simply true or false. Because we want to get the substring matched by the (.*) regular expression, we must do this assignment in array context. As a result of this assignment, $html will be equal to the text delimited by the <body;>... </body;> tags in the HTML file. If we instead had written:
```
	my $html = result of match
```
  Then $html will be set to 1 and the generated output would not be what we expected.
At the end of the for loop body, we close the HTML file and print its extracted body to the output file.
Finally, at the end of the for loop body, we close
After all the files have been processed, we output an appropriate HTML footer and close the output file. If the program died before we had a change to close the output file (for example, if we couldn't change to the diary directory), then perl will close the output file that we had opened for us, flushing any pending output to the file. Perl will also close a file automatically if you attempt to re-open its HANDLE.

Last modified: Fri Apr 4 15:31:09 2003