Monday, April 07, 2003

Manipulating Files and Directories & Process Management (S&P -- Chapters 13 and 14

Consider the following problem -- we have a collection of numerous files (they could be mp3's, spreadsheet files, telemetry data etc.) spread across multiple nested subdirectories of some top level level directory and we want to copy all these files to a new directory hierarchy. The new directory hierarchy is relatively flat -- there are only 26 directories named A, B ... Z. The directory to which a file is copied depends upon the first alphabetic letter of the file's name. Therefore, the file named some_music.mp3 would be copied to the directory named S and the file 03_Hello.mp3 would be copied to the directory named H.

This problem can be solved with the perl script given below. This perl script does some minor directory management and also invokes an external process (namely, the find command) to perform its task. This code also demonstrates the use of a couple of routines for filename manipulation as provided by perl's File module.


#!/usr/bin/perl -w

use strict;

use File::Basename;
use File::Spec;

my $pattern = '\.mp3$';

defined(my $topdir = shift @ARGV) or die "Must specify directory";
die "Must specify absolute path!\n" if substr($topdir, 0, 1) ne "/";

for (split /\x0/, `find $topdir -type f -print0`) {
	my $basename = basename $_;
	if ($basename !~ /$pattern/i) {
		warn "File not matched '$_'!\n";
		next;
	}
	my ($first) = ($basename =~ m/([a-z])/i);
	if (!defined $first) {
		warn "No alphabetic character to use for file '$_'!\n";
		next;
	}
	my $dir = uc $first;
	if (! -d $dir) {
		mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
	}
	my $fname = File::Spec->catfile($dir, $basename);
	link $_, $fname or die "Cannot create link from '$_' to '$fname': $!\n";
}

The main points of the above script can be summarized as follows:

One of the first things that this script does is import the File::Basename and File::Spec modules. These two modules are used to manipulate file names in a portable fashion. To import a module, we write use Module-Name. We will describe how to invoke functions in modules later. (As an aside, note that use strict; is not a module per se, it is a pragma which provides addition information about how perl should compile the script.)
We then initialize a regular expression pattern which we will use to test against all file names that we encounter in the original directory hierarchy. If we encounter a pattern that does not match this pattern, then we will display a warning and continue on. This pattern gives us more granularity with respect to determining which files are copied from the original directory hierarchy to the new one.
We then grab the name of the original directory from the command line arguments and store in in the scalar variable $topdir. Note that we require that the directory be specified in absolute (as opposed to relative) terms. Therefore, invoking the script as:
```
$ ./flatten.pl /home/donald/my_dir
```
would be okay. But invoking it as
```
$ ./flatten.pl ../../my_dir
```
would be detected as an error. Note that to determine whether or not an absolute path name has been specified on the command line, we check the first character of $topdir by using the substr function. In this form of the substr invocation, the function is taking a scalar, an offset and a length. The offset begins at 0 and we take a length of one character. If the character returned from the substr function is not /, then we terminate. (We are using the UNIX-specific directory separator here, so this means that our script will not be entirely portable.)
Note that there are other ways to do this test. For example, we could have written:
```
die "Must specify absolute path!\n" if $topdir !~ m{^/};
```
But using the substr function gives me an excuse to introduce another perl function (which is talked about in Chapter 15).
We then execute the rather cryptic for loop header. This for loop will iterate over all files in the directory given on the command line. To generate the listing of all the files, the perl script relies on the UNIX find command. In order to execute this external program and grab its output, the perl script using the backquote operators ` (don't confuse this with the forward quote. Your browser may not display the backquote correctly, so be careful). By enclosing the invocation of an external program inside quotation marks, perl will run the program and returns its standard output as its result. Therefore, the expression:
```
`find $topdir -type f -print0`
```
will invoke the find command which will return a string containing all the files in the directory given by $topdir. The -print0 will cause each of the file names to be separated by nul bytes. This is important because the directory may contain files that have spaces in them. If the find command returned a string of filenames separated by spaces and some of the filenames had spaces, then we could not use a space as a delimiter when trying to the parse the file names. We use the split function, using a regular expression that consists of just the nul byte \x0 to break the string up into its corresponding file names. Therefore, on each iteration of the loop, $_ will be set to each (fully specified) filename in the directory.
Instead of using the -print0 option to find, we could have just invoked the program in array context. For example, if we had written the above for loop as:
```
chomp(my @files = `find $topdir -type f -print`);
for (@files) {
...
}
```
then each line of output generated by the find command would represent a file name (including any spaces in the filenames). Each of these lines would then be stored as an element in the @files array, which would then be chomped to get rid of the newlines. We can then iterate over this array. Unfortunately, if a file name happened to have a newline in its name, then this approach will not work as that file name will straddle more than one array element. Therefore, although using the split function and -print0 option to find is more cryptic, it is also more robust as it will successfully deal with a file name than has a newline character in it.
Another way to invoke external system commands is to use the system function. For example, system("ls -l"). will run the ls -l command and display the output on STDOUT. If the command executes correctly, zero will be returned as a result. Another function called exec can also be used. The difference is that exec essentially runs the program and the perl process terminates. In otherwords, the call to exec, if successful, never returns because there is no longer anything to return to. Most often, you will want to use system or backquotes to run external processes from within perl.
Inside the for loop, we then use the basename function on the current absolute filename that we are examining. This function is imported by the File::Basename modules and we can invoke it just like any other perl function. This function will determine the filename portion of its argument. For example, if $_ is set to /home/donald/my_files/songs/a_song.mp3, then basename $_ will return a_song.mp3. Note that the function will no use $_ by default, so we have to specify its argument explicitly.
The next thing we do inside the while loop is we test to see if the filename that we extracted matches our pattern. If not, the we use the warn function to display an appropriate warning and then continue to the next filename. warn is similar to die in that it will display its argument to STDERR; however, unlike die, warn will not terminate the program.
When testing the regular expression, we use the !~ binding operator. This operator is similar to the =~ binding operator, except that when used in scalar context, it will return true if the pattern doesn't match and false otherwise.
Next, we extract the first alphabetic character from the filename. If the filename does not have an alphabetic character, then a warning will be displayed and the file will be ignored. One important point to make here is that in the context if the above program, even a song name that consisted of all digits, for example, will still have an alphabetic character, namely the m from the mp3 extension. Therefore, a file with the name 00101001101.mp3 would be filed in the M directory.
After we get the first alphabetic character, we capitalize it using the uc function. This will be our directory name to which we must copy the file currently being examined. Before we can actually copy the file, we must make sure that the destination directory actually exists. As we saw in the last class, we can use the directory test operator -d to do this. If the directory does not exist, we can create it as follows:
```
	mkdir $dir, 0700 or die "Cannot make directory! '$dir': $!\n"
```
The mkdir function takes a directory name and (optionally) a permission mask and attempts to create the directory with the specified permission. If it fails to create the directory for some reason, we will terminate with an appropriate error condition (which uses the special variable $! which contains a string indicating why the mkdir function failed).
Directory permissions are usually specified in octal, so we prefix the permission mask with a zero. We are setting the directory permissions to be read/write/execute for the 'user' and blank for 'group' and 'other'. If we had the permissions stored as octal in a scalar variable, the we would have to use the oct function when specifying the permission mask. Otherwise the scalar variable would be treated as decimal.
Other file/directory manipulation functions offered by perl include chmod and chown which change the permission (mode) of a file and the owner of a file. Of course, these functions will fail if you attempt to use them on files for which you do not have appropriate access permissions.
We then create our new destination file name by prepending the destination directory name with the basename that we determined above. We can do this portably using the File::Spec module. Unlike the File::Basename module, the File::Spec module is object-oriented, meaning that we must invokes its functions (or more correctly methods) a bit differently. We use the notation File::Spec->catfile to invoke the catfile method from the File::Spec module. The catfile method will take its list of arguments and join them together using an appropriate directory separator (which is / on UNIX machines). For example, if $basename was set to song.mp3, then $fname will be set to S/song.mp3.
Finally, we then 'copy' the file from the source destination to the new destination using the link command. This command takes two file names, a old name and new name and creates a new file with the new name which is a 'copy' of the source file. Unlike a traditional file copy, however, no extra space is consumed on the filesystem. Instead, the new name and old name essentially refer to the exact same file on the filesystem. The UNIX file system maintains a reference count for each file on the file system. When we create a link, the reference counter is increased. When we remove a file, the reference count is decreased. The file is only removed when the reference count equals zero. Therefore, if we were to erase the old file, the new file would still exist, but its reference count would be decreased. If we then remove the new file, then the file will actually be removed from the filesystem.
This is what's known in the UNIX world as a hard link. Unfortunately, hard links cannot span across different file partitions, so UNIX also has the concept of a soft (or symbolic) link. Unfortunately, with symbolic links, if you remove (or rename) the original file, then any symbolic links pointing to it will be invalid and essentially unusable (this is similar to dangling pointers in C and C++).
Again, if the link command fails (for example, if the destination file already exists), then we terminate the script with an appropriate error message.

Last modified: Tue Apr 8 00:10:38 2003