The script below, when run on one of the CS machines will generate a file
in the current directory named course.html
which contains
all the 3710 course notes in a single web page.
There are, however, a few problems with the generated file:
DOCTYPE
line for example).
Despite the above shortcomings, the script demonstrates many new features
in Perl that we have not seen yet. In particular, it demonstrates
the use of file tests, some directory operations, extracting user
information from the password file, file globbing, and the special
variable $/
.
#!/usr/bin/perl -w use strict; my $output = "course.html"; die "File '$output' exists!" if -e $output; open OUT, "> $output" or die "Cannot open '$output': $!\n"; print OUT "<html>\n<body>"; defined(my $dir = (getpwnam("donald"))[7]) or die "No such user!"; $dir .= "/.www/comp3710/diary/2003"; die "No such directory '$dir'" if ! -d $dir; chdir $dir or die "Cannot change to '$dir': $!\n"; undef $/; for (glob "*/*/index.html") { if (! -r) { print "File $_ is not readable."; last; } elsif (! -s) { print "File $_ is empty."; next; } open HTML, $_ or die "Cannot open file '$_': $!\n"; my ($html) = (<HTML> =~ m{<body>(.*)</body>}s); close HTML; print OUT $html; } print OUT "</body>\n</html>"; close OUT;
course.html
) for output. This file will contain all
the HTML course notes. Before the file is opened, we make sure that
it doesn't already exist. To do so we use the file test operator
-e
and specify the filename as the operand. This operator
will return a true value if the specified file name exists.
OUT
in the code above.
getpwnam
function accepts a user name and returns
an array of ten elements which represent various user attributes (do a
perdoc -f getpwnam
for more information), the eighth of which
is the directory. So we can extract the eighth element by enclosing the
call the getpwname
in parenthesis and applying the array
indexing operator to it.
-d
file test on the resulting string to make sure that the
string represents a valid directory on the file system. If so, then
we chang to that directory using the chdir
perl function.
If changing to the directory fails, then we will terminate the script
with an appropriate error message (that includes contents of the special
variable that indicates why the chdir
function failed).
chdir
function success, we then undefine the
input record separator variable (undef $/
). The input
record separator determines the delimiter to use when doing input using
the input operator < ... >
. The input record separator
is set to newline by default, therefore the input operator will do input
a line at a time, as we have already seen. By undef
fing
the separator, we are telling perl that the input operator should
input everything. This is not always a wise thing to do,
especially if everything consists of a 20 gigabyte file.
However, as we will see below, it makes the extraction of the relevant
HTML portions of our diary files easier.
For more information on the input record separator (and for information
on all of perl's ``special variables,'' check out the perlvar
man page.
for
loop giving it the expression
glob "*/*/index.html
. The glob
function will
return a list of all the files that match the specified glob pattern.
The first two asterisks will match all the months and days subdirectories
in the diary
directory. The index.html
,
of course, will match each actual diary file itself. Note that
we could have used the glob pattern:
for (glob "$dir/*/*/index.html") {
for our for
loop. This would make the call to chdir
unnecessary, but at the same time would make the lengths of each scalar
inside the list a fair bit longer.
-r
and is of non-zero size
! -s
. The file test operators will use the default variable,
but you have to be careful when doing this (see p. 162 of S&P).
Note that perl uses elsif
and not else if
,
the latter actually causes a sytax error in perl. Also note that if
we encounter a file that we cannot read, we print out an error message
and terminate the for
loop using the last
operator. This operator is similar to break
in C and C++.
If, however, the file is empty, the next
operator is used
which will skip over the rest of the loop body and move to the next file
in the globbed list.
open
does
not use $_
by default, so we have to specify it explicitly.)
If the open succeeds, then we execute the somewhat crytpic command:
my ($html) = (<HTML> =~ m{<body>(.*)</body>}s);
We can break this statement down into a few steps:
undef
fed the
input record separator, this will input the entire contents of the file.
s
option modifier for the regular expression.
This will cause the .
character to match any character
including the newline character. (Remember that dot does
not match newline by default).
my ($html) = result of match
Because the variable $html
is enclosed in parenthesis, the
assignment is going to take place in array context. When interpreted
in array context, the result of a match will be a list containing all
the substrings that were matched by the ``capturing'' parenthesis in
the regular expression. When interpreted in scalar context, the result
of the match would be simply true or false. Because we want to get
the substring matched by the (.*)
regular expression, we
must do this assignment in array context. As a result of this
assignment, $html
will be equal to the text delimited by
the <body;>
... </body;>
tags in
the HTML file. If we instead had written:
my $html = result of match
Then $html
will be set to 1 and the generated
output would not be what we expected.
for
loop body, we close
the HTML file and print its extracted body to the output file.
for
loop body, we close
Last modified: Fri Apr 4 15:31:09 2003