Assignment 4
Due: 11:00 AM on Friday, October 14, 2011
Question #1
A digram is a sequence of two non-whitespace characters. The digram count
matrix associated with a textfile is a 2-dimensional matrix DM whose axes are
each labelled with the sorted lists of non-whitespace characters in the file
such that for non-whitespace characters x and y, DM[x][y] is
the number of occurrences of the digram xy in the given file.
Write and document a Python script digramM.py which takes as
a command-line argument a textfile and computes and prints the digram count matrix
for that file.
Your script must work on datafiles
dm1.dat,
dm2.dat,
dm3.dat, and
dm4.dat
to produce the output given in typescript-file
digramM.script.
Your code must implement the digram matrix as a list of lists.
You may assume that each given textfile has at least one word, i.e., no given
textfile is composed entirely of whitespace characters.
Question #2
Given two textfiles X and X, let the similarity score between these
textfiles be defined as Sim(X, Y) = 1.0 - (SD(X, Y) / (nW(X) + nW(Y))), where
nW(X) and nW(Y) are the numbers of words that occur in X and
Y and SD(X, Y) is the total number of words that occur uniquely in
X or Y, i.e., (the number of words that occur in X that do
not occur in Y) + (the number of words that occur in Y that do not occur
in X). Note that nW() does not count the total number of words in
a file, but rather the number of different words that occur in a file.
Write and document a Python script tcomp4.py which takes as
command-line arguments a master textfile and two or more comparison
textfiles and prints (1) the similarity of the master textfile and each comparison textfile
and (2) the name of the comparison file that is most similar to the
master textfile.
Your script must work on datafiles
tc1.dat,
tc2.dat,
tc3.dat,
tc4.dat,
tc5.dat, and
tc6.dat
to produce the output given in typescript-file
tcomp4.script.
Your code must implement the similarity computation using lists.
Hints
You may find the answer scripts for Assignment #3 of use.
Submission
Please hand in printed copies of all of your Python script files.
You must also submit these files electronically using the
submit-assignment command.
Note that each script file must have the following comment
block at the top, where the X's are replaced with the appropriate
information, followed by a docstring briefly describing the program in that
script. For instance, my script for Question #1 of this assignment would
begin with the following comment block:
#########################################################
## CS 2500 (Fall 2011), Assignment #4, Question #1 ##
## Script File Name: digramM.py ##
## Student Name: Todd Wareham ##
## Login Name: harold ##
## MUN #: 8008765 ##
#########################################################
You do not have to develop your code on our CS departmental systems.
However, as your code will be compiled and tested on our CS departmental
systems as part of the assignment marking process,
you should ensure that your code compiles and runs correctly on at
least one of these systems.
- August 15, 12:40pm
Assignment #4 posted.
Created: August 15, 2011
Last Modified: August 15, 2011