Consider the following problem -- we have a collection of numerous files
(they could be mp3's, spreadsheet files, telemetry data etc.) spread
across multiple nested subdirectories of some top level level directory
and we want to copy all these files to a new directory hierarchy. The new
directory hierarchy is relatively flat -- there are only 26 directories
named A, B ... Z. The directory
to which a file is copied depends upon the first alphabetic letter of
the file's name. Therefore, the file named some_music.mp3
would be copied to the directory named S and the file
03_Hello.mp3 would be copied to the directory named
H.
This problem can be solved with the perl script given below. This
perl script does some minor directory management and also invokes
an external process (namely, the find command) to perform
its task. This code also demonstrates the use of a couple of routines
for filename manipulation as provided by perl's File module.
#!/usr/bin/perl -w
use strict;
use File::Basename;
use File::Spec;
my $pattern = '\.mp3$';
defined(my $topdir = shift @ARGV) or die "Must specify directory";
die "Must specify absolute path!\n" if substr($topdir, 0, 1) ne "/";
for (split /\x0/, `find $topdir -type f -print0`) {
my $basename = basename $_;
if ($basename !~ /$pattern/i) {
warn "File not matched '$_'!\n";
next;
}
my ($first) = ($basename =~ m/([a-z])/i);
if (!defined $first) {
warn "No alphabetic character to use for file '$_'!\n";
next;
}
my $dir = uc $first;
if (! -d $dir) {
mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
}
my $fname = File::Spec->catfile($dir, $basename);
link $_, $fname or die "Cannot create link from '$_' to '$fname': $!\n";
}
The main points of the above script can be summarized as follows:
File::Basename and File::Spec modules. These
two modules are used to manipulate file names in a portable fashion.
To import a module, we write use Module-Name.
We will describe how to invoke functions in modules later.
(As an aside, note that use strict; is not a module per se,
it is a pragma which provides addition information about
how perl should compile the script.)
$topdir.
Note that we require that the directory be specified in absolute (as
opposed to relative) terms. Therefore, invoking the script as:
$ ./flatten.pl /home/donald/my_dir
would be okay. But invoking it as
$ ./flatten.pl ../../my_dir
would be detected as an error. Note that to determine whether or not an
absolute path name has been specified on the command line, we check the
first character of $topdir by using the substr
function. In this form of the substr invocation, the function
is taking a scalar, an offset and a length. The offset begins at 0
and we take a length of one character. If the character returned from
the substr function is not /, then we terminate.
(We are using the UNIX-specific directory separator here, so this means
that our script will not be entirely portable.)
Note that there are other ways to do this test. For example, we could have written:
die "Must specify absolute path!\n" if $topdir !~ m{^/};
But using the substr function gives me an excuse to introduce
another perl function (which is talked about in Chapter 15).
for loop header.
This for loop will iterate over all files in the directory
given on the command line. To generate the listing of all the files,
the perl script relies on the UNIX find command. In order
to execute this external program and grab its output, the perl script
using the backquote operators ` (don't confuse this with the
forward quote. Your browser may not display the backquote correctly, so
be careful). By enclosing the invocation of an external program inside
quotation marks, perl will run the program and returns its standard
output as its result. Therefore, the expression:
`find $topdir -type f -print0`
will invoke the find command which will return a string
containing all the files in the directory given by $topdir.
The -print0 will cause each of the file names to be
separated by nul bytes. This is important because the directory
may contain files that have spaces in them. If the find
command returned a string of filenames separated by spaces and some
of the filenames had spaces, then we could not use a space as a delimiter
when trying to the parse the file names. We use the split
function, using a regular expression that consists of just the
nul byte \x0 to break the string up into its corresponding
file names. Therefore, on each iteration of the loop, $_
will be set to each (fully specified) filename in the directory.
Instead of using the -print0 option to find,
we could have just invoked the program in array context. For example,
if we had written the above for loop as:
chomp(my @files = `find $topdir -type f -print`);
for (@files) {
...
}
then each line of output generated by the find command
would represent a file name (including any spaces in the filenames).
Each of these lines would then be stored as an element in the
@files array, which would then be chomped
to get rid of the newlines. We can then iterate over this array.
Unfortunately, if a file name happened to have a newline in its name,
then this approach will not work as that file name will straddle more
than one array element. Therefore, although using the split
function and -print0 option to find is more
cryptic, it is also more robust as it will successfully deal with a file
name than has a newline character in it.
Another way to invoke external system commands is to use the
system function. For example, system("ls -l").
will run the ls -l command and display the output
on STDOUT. If the command executes correctly, zero will be returned
as a result. Another function called exec can also
be used. The difference is that exec essentially
runs the program and the perl process terminates. In otherwords,
the call to exec, if successful, never returns because
there is no longer anything to return to. Most often, you will
want to use system or backquotes to run external processes
from within perl.
for loop, we then use the
basename function on the current absolute filename
that we are examining. This function is imported by the
File::Basename modules and we can invoke it just like
any other perl function. This function will determine the filename
portion of its argument. For example, if $_ is set to
/home/donald/my_files/songs/a_song.mp3, then basename
$_ will return a_song.mp3. Note that the function
will no use $_ by default, so we have to specify its
argument explicitly.
warn function to display an appropriate
warning and then continue to the next filename. warn
is similar to die in that it will display its argument to
STDERR; however, unlike die, warn
will not terminate the program.
When testing the regular expression, we use the !~ binding
operator. This operator is similar to the =~ binding
operator, except that when used in scalar context, it will return true
if the pattern doesn't match and false otherwise.
m from the mp3 extension.
Therefore, a file with the name 00101001101.mp3 would be
filed in the M directory.
uc function. This will be our directory name
to which we must copy the file currently being examined. Before
we can actually copy the file, we must make sure that the destination
directory actually exists. As we saw in the last class, we can
use the directory test operator -d to do this. If
the directory does not exist, we can create it as follows:
mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
The mkdir function takes a directory name and (optionally)
a permission mask and attempts to create the directory with the
specified permission. If it fails to create the directory for some
reason, we will terminate with an appropriate error condition
(which uses the special variable $! which contains
a string indicating why the mkdir function failed).
Directory permissions are usually specified in octal, so we prefix
the permission mask with a zero. We are setting the directory
permissions to be read/write/execute for the 'user' and blank for
'group' and 'other'. If we had the permissions stored as octal
in a scalar variable, the we would have to use the oct
function when specifying the permission mask. Otherwise the scalar
variable would be treated as decimal.
Other file/directory manipulation functions offered by perl
include chmod and chown which change
the permission (mode) of a file and the owner of a file. Of course,
these functions will fail if you attempt to use them on files for
which you do not have appropriate access permissions.
File::Spec
module. Unlike the File::Basename module, the
File::Spec module is object-oriented, meaning that
we must invokes its functions (or more correctly methods)
a bit differently. We use the notation File::Spec->catfile
to invoke the catfile method from the File::Spec
module. The catfile method will take its list of arguments
and join them together using an appropriate directory separator
(which is / on UNIX machines).
For example, if $basename was set to song.mp3,
then $fname will be set to S/song.mp3.
link command. This command takes
two file names, a old name and new name and creates a new file with the
new name which is a 'copy' of the source file. Unlike a traditional file
copy, however, no extra space is consumed on the filesystem. Instead,
the new name and old name essentially refer to the exact same file on the
filesystem. The UNIX file system maintains a reference count for each
file on the file system. When we create a link, the reference counter
is increased. When we remove a file, the reference count is decreased.
The file is only removed when the reference count equals zero. Therefore,
if we were to erase the old file, the new file would still exist, but
its reference count would be decreased. If we then remove the new file,
then the file will actually be removed from the filesystem.
This is what's known in the UNIX world as a hard link. Unfortunately, hard links cannot span across different file partitions, so UNIX also has the concept of a soft (or symbolic) link. Unfortunately, with symbolic links, if you remove (or rename) the original file, then any symbolic links pointing to it will be invalid and essentially unusable (this is similar to dangling pointers in C and C++).
Again, if the link command fails (for example, if the
destination file already exists), then we terminate the script with an
appropriate error message.
Last modified: Tue Apr 8 00:10:38 2003