April 02 (Friday) April 07 (Wednesday)
Consider the following problem — we have a collection of
numerous files (they could be MP3's, spreadsheet files, telemetry data
etc.) spread across multiple nested subdirectories of some
top level level directory and we want to copy all these files to a
new directory hierarchy. The new directory hierarchy is relatively
flat — there are only 26 directories named A
,
B
... Z
. The directory to which a file is
copied depends upon the first alphabetic letter of the file's name.
Therefore, the file named some_music.mp3
would be copied to
the directory named S
and the file 03_Hello.mp3
would be copied to the directory named H
.
This problem can be solved with the Perl script given below. This
Perl script does some minor directory management and also invokes
an external process (namely, the find
command) to perform
its task. This code also demonstrates the use of a couple of routines
for filename manipulation as provided by Perl's File
module.
#!/usr/bin/perl -w
use strict;
use File::Basename;
use File::Spec;
my $pattern = '\.mp3$';
defined(my $topdir = shift @ARGV) or die "Must specify directory";
die "Must specify absolute path!\n" if substr($topdir, 0, 1) ne "/";
for (split /\x0/, `find $topdir -type f -print0`) {
my $basename = basename $_;
if ($basename !~ /$pattern/i) {
warn "File not matched '$_'!\n";
next;
}
my ($first) = ($basename =~ m/([a-z])/i);
if (!defined $first) {
warn "No alphabetic character to use for file '$_'!\n";
next;
}
my $dir = uc $first;
if (! -d $dir) {
mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
}
my $fname = File::Spec->catfile($dir, $basename);
link $_, $fname or die "Cannot create link from '$_' to '$fname': $!\n";
}
flatten.pl
The main points of the above script are presented below:
File::Basename
and File::Spec
modules. These
two modules are used to manipulate file names in a portable fashion.
To import a module, we write:
use Module-Name;
We will describe how to invoke the module functions later.
(As an aside, note that use strict;
is not a module, per se,
it is a pragma which provides additional information about
how Perl should compile the script.)
$topdir
.
Note that we require that the directory be specified in absolute (as
opposed to relative) terms. Therefore, invoking the script as:
$ ./flatten.pl /home/donald/my_dir
would be okay. But invoking it as:
$ ./flatten.pl ../../my_dir
would be detected as an error. Note that to determine whether or not an
absolute path name has been specified on the command line, we check the
first character of $topdir
by using the substr
function. In this form of the substr
invocation, the function
is taking a scalar, an offset and a length. The offset begins at 0
and we take a length of one character. If the character returned from
the substr
function is not /
, then we terminate.
(We are using the UNIX-specific directory separator here, so this means
that our script will not be entirely portable.)
Note that there are other ways to do this test. For example, we could have written:
die "Must specify absolute path!\n" if $topdir !~ m{^/};
But using the substr
function gives me an excuse to introduce
another Perl function (which is talked about in Chapter 15).
for
loop header.
This for
loop will iterate over all files in the directory
given on the command line. To generate the listing of all the files,
the Perl script relies on the UNIX find
command. In order
to execute this external program and grab its output, the Perl script
uses the backquote operators `
(don't confuse this with
the forward quote. Your browser may not display the backquote correctly,
so be careful). As we saw in the last class, by enclosing the invocation
of an external program inside quotation marks, Perl will run the program
and return its standard output as its result. Therefore, the expression:
`find $topdir -type f -print0`
will invoke the find
command which will return a string
containing all the files in the directory given by $topdir
.
The -print0
will cause each of the file names to be
separated by a nul byte. This is important because the directory
may contain files that have spaces in them. If the find
command returned a string of filenames separated by spaces and some
of the filenames had spaces, then we could not use a space as a delimiter
when trying to the parse the file names. We use the split
function with a regular expression that consists of just the nul byte
\x0
to break the string up into its constituent filenames.
Therefore, on each iteration of the loop, $_
will be set
to each (fully specified) filename in the directory.
Instead of using the -print0
option to find
,
we could have just invoked the program in array context. For example,
if we had written the above for
loop as:
chomp(my @files = `find $topdir -type f -print`); for (@files) { ... }
then each line of output generated by the find
command
would represent a file name (including any spaces in the filenames).
Each of these lines would then be stored as an element in the
@files
array, which would then be chomp
ed
to get rid of the newlines. We can then iterate over this array.
Unfortunately, if a file name happened to have a newline in its name,
then this approach will not work as that file name will straddle more
than one array element. Therefore, although using the split
function and the -print0
option to find
is
more cryptic, it is also more robust as it will successfully deal with
filenames than have a newlines or spaces in them.
Another way to invoke external system commands is to use the
system
function. For example, writing:
will run thesystem("ls -l");
ls -l
command and display the output on
STDOUT. If the command executes correctly, zero will be returned as
a result. Another function called exec
can also be used.
The difference is that exec
essentially runs the program and
the Perl process from which it was run then terminates. In otherwords,
the call to exec
, if successful, never returns because
there is no longer anything to return to. Most often, you will want
to use system
or backquotes to run external processes from
within Perl.
for
loop, we then use the
basename
function on the current absolute filename
that we are examining. This function is imported by the
File::Basename
modules and we can invoke it just like
any other Perl function. This function will determine the filename
portion of its argument. For example, if $_
is set to
/home/donald/my_files/songs/a_song.mp3
, then basename
$_
will return a_song.mp3
. Note that the function
will not use $_
by default, so we have to specify its
argument explicitly.
warn
function to
display an appropriate warning and then continue to the next filename.
warn
is similar to die
in that it will display
its argument to STDERR
; however, unlike die
,
warn
will not terminate the program.
When testing the regular expression, we use the !~
binding
operator. This operator is similar to the =~
binding
operator, except that when used in scalar context, it will return true
if the pattern doesn't match and false otherwise.
m
from the mp3
extension.
Therefore, a file with the name 00101001101.mp3
would be
filed in the M
directory.
uc
function. This will be the directory name
to which we must copy the file currently being examined. Before
we can actually copy the file, we must make sure that the destination
directory actually exists. As we saw in the last class, we can
use the directory test operator -d
to do this. If
the directory does not exist, we can create it as follows:
mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
The mkdir
function takes a directory name and (optionally)
a permission mask and attempts to create the directory with the specified
permission. If it fails to create the directory for some reason, we
will terminate with an appropriate error message which uses the special
variable $!
. (Remember that this variable contains a string
indicating why the previously called Perl function failed.)
Directory permissions are usually specified in octal, so we prefix
the permission mask with a zero. We are setting the directory
permissions to be read/write/execute for the 'user' and blank for
'group' and 'other'. If we had the permissions stored as octal
in a scalar variable, the we would have to use the oct
function when specifying the permission mask. Otherwise the scalar
variable would be treated as decimal.
Other file/directory manipulation functions offered by Perl
include chmod
and chown
which change
the permission (mode) of a file and the owner of a file. Of course,
these functions will fail if you attempt to use them on files for
which you do not have appropriate access permissions.
File::Spec
module. Unlike the File::Basename
module, the
File::Spec
module is object-oriented, meaning that
we must invoke its functions (or more correctly its methods)
a bit differently. We use the notation File::Spec->catfile
to invoke the catfile
method from the File::Spec
module. The catfile
method will take its list of arguments
and join them together using an appropriate directory separator
(which is /
on UNIX machines).
For example, if $basename
was set to song.mp3
,
then $fname
will be set to S/song.mp3
.
link
command. This command takes
two file names, the original name and new name and creates a new file with
the new name which is a 'copy' of the original file. Unlike a traditional
file copy, however, no extra space is consumed on the filesystem
(space for the new inode notwithstanding). Instead, the new name and
original name essentially refer to the exact same file on the filesystem.
The UNIX file system maintains a reference count for each file on the
file system. When we create a link, the reference counter is increased.
When we remove a file, the reference count is decreased. The file is
only removed from the file system when the reference count equals zero.
Therefore, if we were to erase the old file, the new file would still
exist, but its reference count would be decreased. If we then remove
the new file, then the file will actually be removed from the filesystem.
This is what's known in the UNIX world as a hard link. Unfortunately, hard links cannot span across different file partitions, so UNIX also has the concept of a soft (or symbolic) link. Unfortunately, with symbolic links, if you remove (or rename) the original file, then any symbolic links pointing to it will be invalid and essentially unusable (this is similar to dangling pointers in C and C++).
Again, if the link
command fails (for example, if the
destination file already exists), then we terminate the script with an
appropriate error message.
Last modified: April 5, 2004 15:53:04 NDT (Monday)