Assignment 5
Due: 11:00 AM on Wednesday, October 28, 2009
Question #1
A particular author may be roughly characterized by the set of words he
uses in the documents that he or she writes. Given a document X,
let Word(X) denote the set of words used in that document. The
degree to which a document X is likely to have been written by
the same author as another document Y is inversely proportional
between the distance d(X, Y) = |symmetric_difference(Word(X), Word(Y))| /
(|Word(X)| + |Word(Y)|, whose value is 0 if X and Y
are identical and 1 if they are maximally dissimilar, i.e., the
intersection of Word(X) and Word(Y) is empty.
Write and document a Python script authorship.py which takes as
command-line arguments a document-list file (one document-file name per
line) and a new document-file and prints a list of the documents in
the document-list file sorted by the value of d() relative to
the new document-file.
You may assume that the only non-word characters in a document are
apostrophes, quotation marks (single and double), parentheses, exclamation
marks, question marks, colons, semi-colons, commas, and periods.
Your script must work on text-list file
textlist1.txt and document-files
text1.txt,
text2.txt,
text3.txt,
text4.txt,
text5.txt,
text6.txt, and
text7.txt
to produce the output in typescript-file
authorship.script.
You may assume that all given files are formatted correctly.
Question #2
Write and document a Python script stats3.py which takes as a
command-line argument the name of a real-valued square data matrix file
and writes the matrix of column-column Pearson correlation coefficients
(see here for
formula) to a data-matrix-specific file.
The correlation matrix for data matrix file X.dat
are written to data-matrix-specific file X.psc.
Your script must work on data matrix files
data1.dat,
data2.dat, and
data3.dat
to produce correlation-matrix files
data1.psc,
data2.psc, and
data3.psc
as specified in typescript-file
stats3.script.
You may assume that all given files are formatted correctly.
Hints
You may find the various example scripts in the course notes of use.
Submission
Please hand in printed copies of all of your Python script files.
You must also submit these files electronically using the
submit-assignment command.
Note that each script file must have the following comment
block at the top, where the X's are replaced with the appropriate
information, followed by a docstring briefly describing the program in that
script. For instance, my script for Question #1 of this assignment would
begin with the following comment block:
#########################################################
## CS 2500 (Fall 2009), Assignment #5, Question #1 ##
## Script File Name: authorship.py ##
## Student Name: Todd Wareham ##
## Login Name: harold ##
## MUN #: 8008755 ##
#########################################################
You do not have to develop your code on our CS departmental systems.
However, as your code will be interpreted and tested on our CS departmental
systems as part of the assignment marking process,
you should ensure that your code interprets and runs correctly on at
least one of these systems.
- August 25, 12:50pm
Assignment #5 posted.
Created: August 25, 2009
Last Modified: October 26, 2009