Assignment 1
Due: 12:00 PM on Monday, September 22, 2014
Question #1
An n-gram, n greater than or equal to 2, is a sequence of n
non-whitespace characters. The n-gram frequency
matrix associated with a textfile is a matrix NM indexed by the n-grams
in the textfile such that the entry for n-gram x in this matrix is
the total number of occurrences of x in the textfile divided by
the total number of n-grams in the textfile.
Given two textfiles X and Y, let the similarity score between these
textfiles be defined as Sim(X, Y) = 1.0 - (Diff(X, Y)/2.0), where
Diff(X,Y) is the sum over all possible n-grams of the absolute values
of the differences in frequency of occurrence of each n-gram in textfiles
X and Y.
Write and document a Python script tcomp1.py which takes as
command-line arguments a master textfile, a value of n, and two or more comparison
textfiles and prints (1) the similarity of the master textfile and each comparison textfile
and (2) the name of the comparison file that is most similar to the
master textfile.
Your script must work on datafiles
nm1.dat,
nm2.dat,
nm3.dat,
nm4.dat, and
nm5.dat
to produce the output given in typescript-file
tcomp1.script.
Your code must implement the similarity computation using dictionaries to encode the
non-zero entries in n-gram frequency matrices.
You may assume that each given textfile has at least one word, i.e., no given
textfile is composed entirely of whitespace characters.
Question #2
Given two textfiles X and X, let the similarity score between these
textfiles be defined as Sim(X, Y) = 1.0 - (SD(X, Y) / (nW(X) + nW(Y))), where
nW(X) and nW(Y) are the numbers of words that occur in X and
Y and SD(X, Y) is the total number of words that occur uniquely in
X or Y, i.e., (the number of words that occur in X that do
not occur in Y) + (the number of words that occur in Y that do not occur
in X). Note that nW() does not count the total number of words in
a file, but rather the number of different words that occur in a file.
Write and document a Python script tcomp2.py which takes as
command-line arguments a master textfile and two or more comparison
textfiles and prints (1) the similarity of the master textfile and each comparison textfile
and (2) the name of the comparison file that is most similar to the
master textfile.
Your script must work on datafiles
tc1.dat,
tc2.dat,
tc3.dat,
tc4.dat,
tc5.dat, and
tc6.dat
to produce the output given in typescript-file
tcomp2.script.
Your code must implement the similarity computation using sets.
Question #3
Write and document a Python script index1.py which takes as
command-line arguments the names of an ignored-word file (one word per line),
a text-file, and an index-file, and computes and outputs to the
index-file a sorted index describing the lines on which each word in the
text-file that is not in the ignored-word file occurs in the
text-file.
You may assume that the only non-word characters in a text-file are
apostrophes, quotation marks (single and double), parentheses, exclamation
marks, question marks, colons, semi-colons, commas, and periods.
Your script must work on word-file
word1.txt and text-file
text1.txt to produce index-file
index1.txt as specified in
typescript-file
index1.script.
You may assume that all given files are formatted correctly.
Hints
In Q3, you may find it useful to store words in the index and their
associated line-occurrences as a dictionary of lists.
Submission
Please e-mail your Python script files to your instructor at harold@mun.ca
Note that each script file must have the following comment
block at the top, where the X's are replaced with the appropriate
information, followed by a docstring briefly describing the program in that
script. For instance, my script for Question #1 of this assignment would
begin with the following comment block:
#########################################################
## CS 4750 (Fall 2014), Assignment #1, Question #1 ##
## Script File Name: tcomp1.py ##
## Student Name: Todd Wareham ##
## Login Name: harold ##
## MUN #: 8008765 ##
#########################################################
You do not have to develop your code on our CS departmental systems.
However, as your code will be compiled and tested on our CS departmental
systems as part of the assignment marking process,
you should ensure that your code compiles and runs correctly on at
least one of these systems.
- September 17, 1:35pm
Assignment #1 due date revised to noon, Monday, September 22.
- August 29, 1:45pm
Assignment #1 posted.
Created: August 29, 2014
Last Modified: September 17, 2014