March 31 (Wednesday) April 05 (Monday)
The script below, when run on one of the Computer Science machines will
generate a file in the current directory named course.html
which contains all the 3710 online course notes for the current
semester in a single HTML file. You should note the following regarding
the generated file:
Despite the above shortcomings, the script demonstrates many new features
in Perl that we have not seen yet. In particular, it demonstrates the
use of file tests, some directory operations, extracting user information
from the password file, file globbing, and introduces how we run and store
the output of external system commands (e.g. wget
)
in Perl. (This last topic is expanded further in subsequent chapters.)
#!/usr/bin/perl -w
use strict;
my $output = "course.html";
die "File '$output' exists!" if -e $output;
open OUT, "> $output" or die "Cannot open '$output': $!\n";
defined(my $dir = (getpwnam "donald")[7]) or die "No such user!";
$dir .= "/.www/comp3710/diary/";
undef $/;
open STYLE, "$dir/styles.css" or die "Cannot find style file: $!";
my $style = <STYLE>;
close STYLE;
print OUT <<"END";
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
<title>Course Notes — Computer Science 3710 (Winter 2004)</title>
<style type="text/css">
$style
</style>
</head>
<body>
END
$dir .= "2004/";
die "No such directory '$dir'" if ! -d $dir;
chdir $dir or die "Cannot change to '$dir': $!\n";
for (glob "*/*/index.php") {
if (! -r) {
print "File $_ is not readable.";
last;
} elsif (! -s) {
print "File $_ is empty.";
next;
}
my $url = "http://www.cs.mun.ca/~donald/comp3710/diary/2004/$_";
print "Getting $url...\n";
my $result = `/usr/bin/wget -O- $url 2>/dev/null`;
die "Not able to retrieve web page!" unless $result;
my ($html) = ($result =~ m{<body>(.*)</body>}s);
print OUT $html;
print OUT "<hr>\n"
}
print OUT "</body>\n</html>";
close OUT;
notes.pl
course.html
, for output. This file will contain all
the course notes in a single HTML file. Before the file is opened,
we make sure that it doesn't already exist. To do so we use the file
test operator -e
and specify the filename as the operand.
This operator will return a true value if the specified file name exists.
getpwnam
function accepts a user name and returns
an array of ten elements which represent various user attributes (do a
perdoc -f getpwnam
for more information), the eighth of which
is the user's home directory. So we can extract the eighth element by
enclosing the call the getpwname
in parentheses and applying
the array indexing operator to it. We then append the subdirectory
of the online notes to the home directory stored in $dir
.
undef $/
).
The input record separator determines the delimiter to use when doing
input using the input operator <
...>
.
The input record separator is set to newline by default, therefore the
input operator will do input a line at a time, as we have already seen.
By undef
fing the input record separator, we are telling Perl
that the input operator should input everything. This is not
always a wise thing to do, especially if everything consists of a
20 gigabyte file. For more information on the input record separator (and
for information on all of Perl's so-called special variables,)
check out the perlvar
man page.
We store the entire contents of the style sheet file in the
$style
variable.
print OUT <<"END"; ... END
Everything between the two END
delimiters will be displayed
verbatim. Note, however, that interpolation of the $style
scalar variable still takes place inside the here document.
Remember that there is no comma after the file handle (which is
OUT
in the code above).
$dir
variable to denote the base
directory of the actual diary entries. We then use the -d
file test on the resulting string to make sure that the string represents
a valid directory on the file system. If so, then we change to that
directory using the chdir
Perl function. If changing
to the directory fails, then we will terminate the script with an
appropriate error message (that includes contents of the special
variable $!
that indicates why the chdir
function failed).
chdir
function succeeds, we then start
our for
loop giving it the expression glob
"*/*/index.php"
. The glob
function will return a list
of all the files that match the specified glob
pattern.
The first two asterisks will match all the month and day subdirectories
in the diary
directory. The index.php
,
of course, will match each actual diary file itself. Do not confuse
glob
patterns with regular expression patterns — the
two are different. Note that we could have used the glob
pattern:
for (glob "$dir/*/*/index.html") { ... }
for our for
loop. This would make the call to chdir
unnecessary, but at the same time would make the lengths of each scalar
inside the list a fair bit longer.
-r
, and is of
non-zero size, using ! -s
. The file test operators will
use the default variable, but you have to be careful when doing this
(see p. 162 of S&P). Note that Perl uses the elsif
keyword and not else if
, the latter actually causes a syntax
error in Perl. Also note that if we encounter a file that we cannot
read, we print out an error message and terminate the for
loop using the last
operator. This operator is similar
to break
in C and C++. If, however, the file is empty,
the next
operator is used which will skip over the rest of
the loop body and move to the next file in the globbed list.
wget
program to acquire the web page. This can be done by including the
command inside backquotes:
my $result = `/usr/bin/wget -O- $url 2>/dev/null`;
This will run the wget
program and store the standard
output from the program in the $result
scalar.
(The -O-
option to wget
causes it
to write the retrieved HTML file to standard output rather than
write it to a file.)
my ($html) = ($result =~ m{<body>(.*)</body>}s);
We can break this statement down into a couple of steps:
wget
and match it against the regular expression
m{<body>(.*)</body>}s
. This will match the
body of the HTML file and will remember the body's contents (without the
two body
tags). Note that we use the s
option
modifier for the regular expression. This will cause the .
character to match any character including the newline character.
(Remember that dot does not match newline by default.)
my ($html) = result of match
Because the variable $html
is enclosed in parenthesis, the
assignment is going to take place in array context. When interpreted
in array context, the result of a match will be a list containing all
the substrings that were matched by the ``capturing'' parenthesis in
the regular expression. When interpreted in scalar context, the result
of the match would be simply true or false. Because we want to get
the substring matched by the (.*)
regular expression, we
must do this assignment in array context. As a result of this
assignment, $html
will be equal to the text delimited by
the <body>
...</body>
tags in
the HTML file. If, instead, we had written:
my $html = result of match
Then $html
will be set to 1 and the generated
output would not be what we expected.
wget
to the output file.
Once you've generated the web page, you can load it in into a web
browser such as firefox or konqueror and ``print''
it to a PostScript file. (This may take a minute or two.) Then, you
can use the psnup
utility to convert the PostScript file
into another PostScript file that has two pages per sheet:
$ psnup -2 -pletter course.ps > course-2up.ps [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] [43] [44] [45] [46] [47] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] Wrote 93 pages, 5128846 bytes $
The resulting course-2up.ps
file can then be printed.
Last modified: April 7, 2004 15:43:00 NDT (Wednesday)