Assignment 2
Due: 11:59 PM via Brightspace on Tuesday, October 25, 2022
Program and test a Python-based system for creating and using finite-state
transducers by performing the following tasks:
- Write the following Python functions:
- readFST(filename): Read in a description of a FST stored
in file filename and return a stored version of that
FST.
The format of a FST description file is as follows: The first
line gives the number of states in the FST and a string
containing all characters in the lower and upper strings
encoded in the FST. The special symbol "-" will be used to
denote the symbol epsilon. This is followed by one or more
groups of lines, one per state, which encode the transitions going
outwards from that state. In each such group, the first line
gives the state number and an character indicating whether
that state is ("F") or is not ("N") final and each subsequent line
specifies a transition by the lower and upper symbols in that
transition and the number of the state towards which that
transition is directed. The states in an n-state FST will
be numbered from 1 to n and state 1 is assumed to be the
start state. Two example FST descriptions for FST over the
alphabet {a,b,c} that (1) transform all b's following the leftmost a
into c's and (2) delete all b's except the leftmost one are given in
files test1.fst and
test2.fst, respectively.
- composeFST(F1,F2): Given two stored versions F1 and
F2, create and return the FST that is F1
composed with F2.
- reconstructUpper(l, F): Print the set of upper strings
associated with lower string l by FST F.
- reconstructLower(u, F): Print the set of lower strings
associated with upper string u by FST F.
- Using the functions described above, write and document a Python script
reconstruct.py which performs different functions depending
on the following command-line argument-sets:
- reconstruct.py surface wlf F1 F2 ... Fn: Print the
surface (upper) forms associated with each of the lexical (lower) forms
in lexical-form file wlf relative to the FST
created by composing the FST described in files F1
through Fn in that order. It is possible that only
a single FST may be specified.
- reconstruct.py lexical wsf F1 F2 ... Fn: Print the
lexical (lower) forms associated with each of the surface (upper) forms
in surface-form file wsf relative to the FST
created by composing the FST described in files F1
through Fn in that order. It is possible that only
a single FST may be specified.
Prior to printing any reconstructed word-forms, each of these
commands must print the number of states and the number of transitions
in the composed FST (which may or may not include a composed lexicon
FST, depending on the command executed) used to reconstruct word-forms.
Moreover, for each given form in a lexical- or surface-form file, that
form must be printed immediately before the list of reconstructed forms
associated with that form.
- Create description files for the following FST:
- vcePlu.fst: Change a plural morpheme [s] to [z] if the
final sound in a word is voiced, e.g., [podPs] becomes
[podPz].
- addVowPlu.fst Add a vowel [e] immediately before the
final plural morpheme if the final sound in the word is
a fricative, e.g., [diSPs] becomes [diSPes].
These FST, when combined, simulate the FST in Example #4 in Lecture #10;
note, however, that the two FST above (unlike the FST in Example #4 in
Lecture #10) pass on lexical (lower) forms that do not contain the plural marker P
and do not add or delete plural markers.
- Using the Python script reconstruct.py described above, perform
the following tests relative to your FST description files
vcePlu.fst and addVowPlu.fst and given lexical-form
file word.lex, surface-form file
word.srf, and FST description
files delPlu1.fst and
delPlu2.fst:
- reconstruct.py surface word.lex vcePlu.fst
- reconstruct.py lexical word.srf vcePlu.fst
- reconstruct.py surface word.lex addVowPlu.fst
- reconstruct.py lexical word.srf addVowPlu.fst
- reconstruct.py surface word.lex vcePlu.fst addVowPlu.fst
- reconstruct.py lexical word.srf vcePlu.fst addVowPlu.fst
- reconstruct.py surface word.lex vcePlu.fst addVowPlu.fst delPlu1.fst
- reconstruct.py lexical word.srf vcePlu.fst addVowPlu.fst delPlu1.fst
- reconstruct.py surface word.lex vcePlu.fst addVowPlu.fst delPlu2.fst
- reconstruct.py lexical word.srf vcePlu.fst addVowPlu.fst delPlu2.fst
Your script must work on the requested tests and provided datafiles
to produce the output given in typescript-file
reconstruct.script.
Your Python code is only allowed to use language-internal features and
the language-standard packages "io" and "sys". Any use of packages such
as "nltk" or "openFST" written in Python or any other language that
gives NLP, FSA, Trie, and/or FST functionality will result in an
assignment mark of zero.
Hints
You may find it useful to store an FST as a dictionary of dictionaries,
where each sub-dictionary encodes the transitions outward from a
particular state. In the sub-dictionaries, as there may be multiple
transitions outwards from a state labelled with the same l/u symbol-pair,
consider having lists of state-designators rather than single state-designators
as the values associated with an l/u symbol-pair key.
You may find it useful to implement the required functions composeFST,
reconstructUpper, and reconstructLower
using algorithms given in Lectures #12 and 13.
Relative to the symbols present in the word-forms in files word.lex and
word.srf, symbol P is the plural marker, symbols [p], [t], [s], and [S]
are voiceless sounds (with [S] corresponding to [sh] in the ARPAbet), symbols
[b], [d], [e], [o], [z], and [Z] are voiced sounds (with [Z] corresponding to
[zh] in the ARPAbet), and symbols [s], [z], [S], and [Z] are fricatives.
Submission
Please submit copies of all of your Python code electronically (including
your FST description files vcePlu.fst and addVowPlu.fst) along
with a textscript file (created using Linux command "script") giving the
results of the ten requested tests in
one zip file through the course D2L / Brightspace
shell by 11:59pm on the assignment due date.
Submission of a script file
whose results cannot be replicated by the associated submitted Python
code will result in an assignment mark of zero.
Note that each Python script file must have the following
comment block at the top, where the X's are replaced
with the appropriate information, followed by a docstring briefly describing
the program in that script. For instance, my reconstruct.py script for
this assignment would begin with the following comment block:
#########################################################
## CS 4750 (Fall 2022), Assignment #2 ##
## Script File Name: reconstruct.py ##
## Student Name: Todd Wareham ##
## Login Name: harold ##
## MUN #: 8008765 ##
#########################################################
You do not have to develop your code on our CS departmental systems.
However, as your code will be compiled and tested on our CS departmental
systems as part of the assignment marking process,
you should ensure that your code compiles and runs correctly on at
least one of these systems.
- October 18, 1:00pm
Assignment #2 due date shifted to Tuesday, October 25.
- October 5, 10:30am
Assignment #2 due date shifted to Thursday, October 20.
- August 1, 1:30pm
Assignment #2 posted.
Created: August 1, 2022
Last Modified: October 19, 2022