Assignment 2
Due: 12:00 PM on Friday, November 7, 2014
Program and test a Python-based system for creating and using lexical
transducers by performing the following tasks:
- Write the following Python functions:
- readFST(filename): Read in a description of a FST stored
in file filename and return a stored version of that
FST.
The format of a FST description file is as follows: The first
line gives the number of states in the FST and a string
containing all characters in the lower and upper strings
encoded in the FST. The special symbol "-" will be used to
denote the symbol epsilon. This is followed by one or more
groups of lines, one per state, which encode the transitions going
outwards from that state. In each such group, the first line
gives the state number and an character indicating whether
that state is ("F") or is not ("N") final and each subsequent line
specifies a transition by the lower and upper symbols in that
transition and the number of the state towards which that
transition is directed. The states in an n-state FST will
be numbered from 1 to n and state 1 is assumed to be the
start state. Two example FST descriptions for FST over the
alphabet {a,b,c} that (1) transform all b's following the leftmost a
into c's and (2) delete all b's except the leftmost one are given in
files test1.fst and
test2.fst, respectively.
- composeFST(F1,F2): Given two stored versions F1 and
F2, create and return the FST that is F1
composed with F2.
- createTrieFST(filename): Given a list of lexical forms
stored in file filename, create and return a stored
version of the lexicon FST encoding this list of words (recall
that a lexicon FST is simply the identity FST associated with
a lexicon DFA).
The format of a lexical-form file is as follows: The first line
is a string containing all of the symbols used in the lexical forms
in the file, followed by the lexical forms (one per line). A sample
lexical-form file is given in file word.lex.
- reconstructUpper(l, F): Print the set of upper strings
associated with lower string l by FST F.
- reconstructLower(u, F): Print the set of lower strings
associated with upper string u by FST F.
- Using the functions described above, write and document a Python script
reconstruct.py which performs different functions depending
on the following command-line argument-sets:
- reconstruct.py surface wlf F1 F2 ... Fn: Print the
surface forms associated with each of the lexical forms
in lexical-form file wlf relative to the FST
created by composing the FST described in files F1
through Fn in that order. It is possible that only
a single FST may be specified.
- reconstruct.py lexical wsf F1 F2 ... Fn: Print the
lexical forms associated with each of the surface forms
in surface-form file wsf relative to the FST
created by composing the FST described in files F1
through Fn in that order. It is possible that only
a single FST may be specified.
- reconstruct.py surface wlf1 lex wlf2 F1 F2 ... Fn: Print the
surface forms associated with each of the lexical forms in
lexical-form file wlf1 relative to the lexical transducer
created by composing the trie-FST encoding the lexical forms in
lexical-form file wlf2 with the FST described in files
F1 through Fn in that order. It is possible that only
a single FST may be specified.
- reconstruct.py lexical wsf lex wlf F1 F2 ... Fn: Print the
lexical forms associated with each of the surface forms in
surface-form file wsf relative to the lexical transducer
created by composing the trie-FST encoding the lexical forms in
lexical-form file wlf with the FST described in files
F1 through Fn in that order. It is possible that only
a single FST may be specified.
Prior to printing any reconstructed word-forms, each of these
commands must print the number of states and the number of transitions
in the composed FST (which may or may not include a composed lexicon
FST, depending on the command executed) used to reconstruct word-forms.
Moreover, for each given form in a lexical- or surface-form file, that
form must be printed immediately before the list of reconstructed forms
associated with that form.
- Create FST description files vcePlu.fst and
addVowPlu.fst encoding modified versions of the FST
described in Example #3 of Lecture #15 which additionally
pass on lexical forms that do not contain the plural marker P.
Note that neither of these FST can add or delete plural markers.
- Using the Python script reconstruct.py described above, perform
the following tests relative to your FST description files
vcePlu.fst and addVowPlu.fst and given lexical-form
file word.lex, surface-form file
word.srf, and FST description
files delPlu1.fst and
delPlu2.fst:
- reconstruct.py surface word.lex vcePlu.fst
- reconstruct.py lexical word.srf vcePlu.fst
- reconstruct.py surface word.lex addVowPlu.fst
- reconstruct.py lexical word.srf addVowPlu.fst
- reconstruct.py surface word.lex vcePlu.fst addVowPlu.fst
- reconstruct.py lexical word.srf vcePlu.fst addVowPlu.fst
- reconstruct.py surface word.lex vcePlu.fst addVowPlu.fst delPlu1.fst
- reconstruct.py lexical word.srf vcePlu.fst addVowPlu.fst delPlu1.fst
- reconstruct.py surface word.lex vcePlu.fst addVowPlu.fst delPlu2.fst
- reconstruct.py lexical word.srf vcePlu.fst addVowPlu.fst delPlu2.fst
- reconstruct.py surface word.lex lex word.lex vcePlu.fst addVowPlu.fst delPlu1.fst
- reconstruct.py lexical word.srf lex word.lex vcePlu.fst addVowPlu.fst delPlu1.fst
- reconstruct.py surface word.lex lex word.lex vcePlu.fst addVowPlu.fst delPlu2.fst
- reconstruct.py lexical word.srf lex word.lex vcePlu.fst addVowPlu.fst delPlu2.fst
Your Python code is only allowed to use language-internal features and
the language-standard packages "io" and "sys". Any use of packages such
as "nltk" or "openFST" written in Python or any other language that
gives NLP, FSA, Trie, and/or FST functionality will result in an
assignment mark of zero.
Hints
You may find it useful to store an FST as a dictionary of dictionaries,
where each sub-dictionary encodes the transitions outward from a
particular state.
You may find it useful to implement the required functions composeFST,
createTrieFST, reconstructUpper, and reconstructLower
using algorithms given in Lectures #13 and 14.
Relative to the symbols present in the word-forms in files word.lex and
word.srf, symbol P is the plural marker, symbols [p], [t], [s], and [S]
are voiceless sounds (with [S] corresponding to [sh] in the ARPAbet) and symbols
[b], [d], [e], [o], [z], and [Z] are voiced sounds (with [Z] corresponding to [zh]
in the ARPAbet),
Submission
Please submit copies of all of your Python code electronically, along
with a textscript file (created using Linux command "script") giving the
results of the twelve requested tests, by e-mailing these files (preferavbly in one e-mail message)
to your instructor at harold@mun.ca.
Submission of a script file
whose results cannot be replicated by the associated submitted Python
code will result in an assignment mark of zero.
Note that each Python script file must have the following
comment block at the top, where the X's are replaced
with the appropriate information, followed by a docstring briefly describing
the program in that script. For instance, my reconstruct.py script for
this assignment would begin with the following comment block:
#########################################################
## CS 4750 (Fall 2014), Assignment #2 ##
## Script File Name: reconstruct.py ##
## Student Name: Todd Wareham ##
## Login Name: harold ##
## MUN #: 8008765 ##
#########################################################
You do not have to develop your code on our CS departmental systems.
However, as your code will be compiled and tested on our CS departmental
systems as part of the assignment marking process,
you should ensure that your code compiles and runs correctly on at
least one of these systems.
- November 6, 9:45am
Fixed new (and hopefully final) bug in Assignment #2 in delPLu2.fst.
- November 3, 9:55am
Fixed new bugs in Assignment #2 in delPlu1.fst, delPLu2.fst, and word.lex.
- October 24, 10:45am
Fixed bugs in Assignment #2 in original given files delPlu1.fst and delPLu2.fst; added
complete instructions for electronic submission of Assignment #2.
- October 22, 1:15am
Assignment #2 due date changed to Wednesday, November 5.
- October 6, 10:10am
Assignment #2 posted.
Created: October 6, 2014
Last Modified: November 6, 2014