Consider the following problem -- we have a collection of numerous files
(they could be mp3's, spreadsheet files, telemetry data etc.) spread
across multiple nested subdirectories of some top level level directory
and we want to copy all these files to a new directory hierarchy. The new
directory hierarchy is relatively flat -- there are only 26 directories
named A
, B
... Z
. The directory
to which a file is copied depends upon the first alphabetic letter of
the file's name. Therefore, the file named some_music.mp3
would be copied to the directory named S
and the file
03_Hello.mp3
would be copied to the directory named
H
.
This problem can be solved with the perl script given below. This
perl script does some minor directory management and also invokes
an external process (namely, the find
command) to perform
its task. This code also demonstrates the use of a couple of routines
for filename manipulation as provided by perl's File
module.
#!/usr/bin/perl -w use strict; use File::Basename; use File::Spec; my $pattern = '\.mp3$'; defined(my $topdir = shift @ARGV) or die "Must specify directory"; die "Must specify absolute path!\n" if substr($topdir, 0, 1) ne "/"; for (split /\x0/, `find $topdir -type f -print0`) { my $basename = basename $_; if ($basename !~ /$pattern/i) { warn "File not matched '$_'!\n"; next; } my ($first) = ($basename =~ m/([a-z])/i); if (!defined $first) { warn "No alphabetic character to use for file '$_'!\n"; next; } my $dir = uc $first; if (! -d $dir) { mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n" } my $fname = File::Spec->catfile($dir, $basename); link $_, $fname or die "Cannot create link from '$_' to '$fname': $!\n"; }
The main points of the above script can be summarized as follows:
File::Basename
and File::Spec
modules. These
two modules are used to manipulate file names in a portable fashion.
To import a module, we write use Module-Name
.
We will describe how to invoke functions in modules later.
(As an aside, note that use strict;
is not a module per se,
it is a pragma which provides addition information about
how perl should compile the script.)
$topdir
.
Note that we require that the directory be specified in absolute (as
opposed to relative) terms. Therefore, invoking the script as:
$ ./flatten.pl /home/donald/my_dir
would be okay. But invoking it as
$ ./flatten.pl ../../my_dir
would be detected as an error. Note that to determine whether or not an
absolute path name has been specified on the command line, we check the
first character of $topdir
by using the substr
function. In this form of the substr
invocation, the function
is taking a scalar, an offset and a length. The offset begins at 0
and we take a length of one character. If the character returned from
the substr
function is not /
, then we terminate.
(We are using the UNIX-specific directory separator here, so this means
that our script will not be entirely portable.)
Note that there are other ways to do this test. For example, we could have written:
die "Must specify absolute path!\n" if $topdir !~ m{^/};
But using the substr
function gives me an excuse to introduce
another perl function (which is talked about in Chapter 15).
for
loop header.
This for
loop will iterate over all files in the directory
given on the command line. To generate the listing of all the files,
the perl script relies on the UNIX find
command. In order
to execute this external program and grab its output, the perl script
using the backquote operators `
(don't confuse this with the
forward quote. Your browser may not display the backquote correctly, so
be careful). By enclosing the invocation of an external program inside
quotation marks, perl will run the program and returns its standard
output as its result. Therefore, the expression:
`find $topdir -type f -print0`
will invoke the find
command which will return a string
containing all the files in the directory given by $topdir
.
The -print0
will cause each of the file names to be
separated by nul bytes. This is important because the directory
may contain files that have spaces in them. If the find
command returned a string of filenames separated by spaces and some
of the filenames had spaces, then we could not use a space as a delimiter
when trying to the parse the file names. We use the split
function, using a regular expression that consists of just the
nul byte \x0
to break the string up into its corresponding
file names. Therefore, on each iteration of the loop, $_
will be set to each (fully specified) filename in the directory.
Instead of using the -print0
option to find
,
we could have just invoked the program in array context. For example,
if we had written the above for
loop as:
chomp(my @files = `find $topdir -type f -print`); for (@files) { ... }
then each line of output generated by the find
command
would represent a file name (including any spaces in the filenames).
Each of these lines would then be stored as an element in the
@files
array, which would then be chomp
ed
to get rid of the newlines. We can then iterate over this array.
Unfortunately, if a file name happened to have a newline in its name,
then this approach will not work as that file name will straddle more
than one array element. Therefore, although using the split
function and -print0
option to find
is more
cryptic, it is also more robust as it will successfully deal with a file
name than has a newline character in it.
Another way to invoke external system commands is to use the
system
function. For example, system("ls -l")
.
will run the ls -l
command and display the output
on STDOUT. If the command executes correctly, zero will be returned
as a result. Another function called exec
can also
be used. The difference is that exec
essentially
runs the program and the perl process terminates. In otherwords,
the call to exec
, if successful, never returns because
there is no longer anything to return to. Most often, you will
want to use system
or backquotes to run external processes
from within perl.
for
loop, we then use the
basename
function on the current absolute filename
that we are examining. This function is imported by the
File::Basename
modules and we can invoke it just like
any other perl function. This function will determine the filename
portion of its argument. For example, if $_
is set to
/home/donald/my_files/songs/a_song.mp3
, then basename
$_
will return a_song.mp3
. Note that the function
will no use $_
by default, so we have to specify its
argument explicitly.
warn
function to display an appropriate
warning and then continue to the next filename. warn
is similar to die
in that it will display its argument to
STDERR
; however, unlike die
, warn
will not terminate the program.
When testing the regular expression, we use the !~
binding
operator. This operator is similar to the =~
binding
operator, except that when used in scalar context, it will return true
if the pattern doesn't match and false otherwise.
m
from the mp3
extension.
Therefore, a file with the name 00101001101.mp3
would be
filed in the M
directory.
uc
function. This will be our directory name
to which we must copy the file currently being examined. Before
we can actually copy the file, we must make sure that the destination
directory actually exists. As we saw in the last class, we can
use the directory test operator -d
to do this. If
the directory does not exist, we can create it as follows:
mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
The mkdir
function takes a directory name and (optionally)
a permission mask and attempts to create the directory with the
specified permission. If it fails to create the directory for some
reason, we will terminate with an appropriate error condition
(which uses the special variable $!
which contains
a string indicating why the mkdir
function failed).
Directory permissions are usually specified in octal, so we prefix
the permission mask with a zero. We are setting the directory
permissions to be read/write/execute for the 'user' and blank for
'group' and 'other'. If we had the permissions stored as octal
in a scalar variable, the we would have to use the oct
function when specifying the permission mask. Otherwise the scalar
variable would be treated as decimal.
Other file/directory manipulation functions offered by perl
include chmod
and chown
which change
the permission (mode) of a file and the owner of a file. Of course,
these functions will fail if you attempt to use them on files for
which you do not have appropriate access permissions.
File::Spec
module. Unlike the File::Basename
module, the
File::Spec
module is object-oriented, meaning that
we must invokes its functions (or more correctly methods)
a bit differently. We use the notation File::Spec->catfile
to invoke the catfile
method from the File::Spec
module. The catfile
method will take its list of arguments
and join them together using an appropriate directory separator
(which is /
on UNIX machines).
For example, if $basename
was set to song.mp3
,
then $fname
will be set to S/song.mp3
.
link
command. This command takes
two file names, a old name and new name and creates a new file with the
new name which is a 'copy' of the source file. Unlike a traditional file
copy, however, no extra space is consumed on the filesystem. Instead,
the new name and old name essentially refer to the exact same file on the
filesystem. The UNIX file system maintains a reference count for each
file on the file system. When we create a link, the reference counter
is increased. When we remove a file, the reference count is decreased.
The file is only removed when the reference count equals zero. Therefore,
if we were to erase the old file, the new file would still exist, but
its reference count would be decreased. If we then remove the new file,
then the file will actually be removed from the filesystem.
This is what's known in the UNIX world as a hard link. Unfortunately, hard links cannot span across different file partitions, so UNIX also has the concept of a soft (or symbolic) link. Unfortunately, with symbolic links, if you remove (or rename) the original file, then any symbolic links pointing to it will be invalid and essentially unusable (this is similar to dangling pointers in C and C++).
Again, if the link
command fails (for example, if the
destination file already exists), then we terminate the script with an
appropriate error message.
Last modified: Tue Apr 8 00:10:38 2003