Assignment 1

Due: 12:00 PM on Monday, September 22, 2014

Question #1

Sim(X, Y) = 1.0 - (Diff(X, Y)/2.0)

Diff(X,Y)

Write and document a Python script tcomp1.py which takes as command-line arguments a master textfile, a value of n, and two or more comparison textfiles and prints (1) the similarity of the master textfile and each comparison textfile and (2) the name of the comparison file that is most similar to the master textfile. Your script must work on datafiles nm1.dat, nm2.dat, nm3.dat, nm4.dat, and nm5.dat to produce the output given in typescript-file tcomp1.script. Your code must implement the similarity computation using dictionaries to encode the non-zero entries in n-gram frequency matrices. You may assume that each given textfile has at least one word, i.e., no given textfile is composed entirely of whitespace characters.

Question #2

Sim(X, Y) = 1.0 - (SD(X, Y) / (nW(X) + nW(Y)))

nW(X)

nW(Y)

SD(X, Y)

i.e.

nW()

Write and document a Python script tcomp2.py which takes as command-line arguments a master textfile and two or more comparison textfiles and prints (1) the similarity of the master textfile and each comparison textfile and (2) the name of the comparison file that is most similar to the master textfile. Your script must work on datafiles tc1.dat, tc2.dat, tc3.dat, tc4.dat, tc5.dat, and tc6.dat to produce the output given in typescript-file tcomp2.script. Your code must implement the similarity computation using sets.

Question #3

index1.py

not

Hints

In Q3, you may find it useful to store words in the index and their associated line-occurrences as a dictionary of lists.

Submission

harold@mun.ca

comment block

#########################################################
##  CS 4750 (Fall 2014), Assignment #1, Question #1    ##
##   Script File Name: tcomp1.py                       ##
##       Student Name: Todd Wareham                    ##
##         Login Name: harold                          ##
##              MUN #: 8008765                         ##
#########################################################

assignment marking process

Additional Notes:

September 17, 1:35pm
Assignment #1 due date revised to noon, Monday, September 22.
August 29, 1:45pm
Assignment #1 posted.