aa 0/7 ab 2/7 ba 3/7 bb 2/7and the 3-gram frequency vector is
aaa 0/4 aab 0/4 aba 2/4 abb 0/4 baa 0/4 bab 1/4 bba 1/4 bbb 0/4Given two textfiles X and Y, let the similarity score between these textfiles be defined as Sim(X, Y) = 1.0 - (Diff(X, Y)/2.0), where Diff(X,Y) is the sum over all possible n-grams of the absolute values of the differences in frequency of occurrence of each n-gram in textfiles X and Y.
Write and document a Python script tcomp1.py which takes as command-line arguments a master textfile, a value of n, and two or more comparison textfiles and prints (1) the similarity of the master textfile and each comparison textfile and (2) the name of the comparison file that is most similar to the master textfile. Your script must work on datafiles nm1.dat, nm2.dat, nm3.dat, nm4.dat, and nm5.dat to produce the output given in typescript-file tcomp1.script. Your code must implement the similarity computation using dictionaries to encode the non-zero entries in n-gram frequency matrices. You may assume that each given textfile has at least one word, i.e., no given textfile is composed entirely of whitespace characters.
Write and document a Python script tcomp2.py which takes as command-line arguments a master textfile and two or more comparison textfiles and prints (1) the similarity of the master textfile and each comparison textfile and (2) the name of the comparison file that is most similar to the master textfile. Your script must work on datafiles tc1.dat, tc2.dat, tc3.dat, tc4.dat, tc5.dat, and tc6.dat to produce the output given in typescript-file tcomp2.script. Your code must implement the similarity computation using sets.
######################################################### ## CS 4750 (Fall 2024), Assignment #1, Question #1 ## ## Script File Name: tcomp1.py ## ## Student Name: Todd Wareham ## ## Login Name: harold ## ## MUN #: 8008765 ## #########################################################You do not have to develop your code on our CS departmental systems. However, as your code will be compiled and tested on our CS departmental systems as part of the assignment marking process, you should ensure that your code compiles and runs correctly on at least one of these systems.
Created: June 30, 2024
Last Modified: June 30, 2024