Artificial Intelligence 6001, Winter '26
Course Diary (Wareham)
Copyright 2026 by H.T. Wareham
All rights reserved
Week 1,
Week 2,
(end of diary)
Wednesday, January 28 (Lecture #1)
(FS)
[Class Notes]
- Introduction: What is Natural Language?
- A natural language is any language (be it spoken,
signed, or written) used by human beings to communicate
with each other, cf. artificial languages used to program
computers.
- There are ~7000 natural languages, with ~20 these languages,
e.g., Mandarin, Spanish, English, Hindi, and Arabic, being
spoken by more than 1% of the world's population.
- Example: The natural languages spoken by students in this course.
- Human language allows a complexity of expression and meaning
that has to date not been seen in linguistic behaviour
in animals, e.g., bird calls,, primate vocalizations, bee
dances, which are more akin to lists of codewords
with specific meanings that are not combined, e.g.,
"danger", "(this way to) food".
- Gorilla ASL experiments suggest that it is not just
the lack of an appropriate vocal apparatus -- the
ability to combine words to form arbitrarily complex
expressions is simply not present in animals.
- Talking animals are for now firmly in the
domain of fiction (and comedy).
- Example: Not the Nine O'Clock News (1980): Gerald the Gorilla
(Link)
- The ability to acquire and effectively communicate with a natural language
is one of the major hallmarks of human intelligence and sentience;
those individuals who have for whatever reasons failed to develop
this ability are often considered subhuman, e.g.,
feral children (the Wild Boy of Aveyron), the deaf (Children of a
Lesser God, Helen Keller).
- Example: The Miracle Worker (1961) [excerpt]
- The ability to acquire and effectively communicate with a natural
language is thus one of the major hallmarks of true artificial intelligence.
- Introduction: What is Natural Language Processing (NLP)?
- NLP is the subfield of Artificial Intelligence (AI)
concerned with the computations underlying
the processing (recognition, generation, and
acquisition) of natural human language (be it
spoken, signed, or written).
- NLP is distinguished from (though closely related to)
the processing of artificial languages, e.g.,
computer languages, formal languages (regular,
context-free, context-sensitive, etc).
- NLP emerged in the early 1950's with the first
commercial computers; originally focused on
machine translation, but subsequently broadened
to include all natural-language related tasks
(J&M, Section 1.6; Kay (2003); see also BKL,
Section 1.5 and Afterword).
- Two flavors of NLP: strong and narrow
- The focus of Strong NLP is on discovering
and implementing human-level language abilities
using approximately
the same mechanisms as human beings when
they communicate; as such, it is more
closely allied with classical linguistics
(and hence often goes by the name
"computational linguistics").
- The focus of Narrow NLP is on giving computers
human-level language abilities by whatever
means possible; as such, it is more closely
allied with AI (and is often referred to
by either "NLP" or more specific terms like "speech
processing").
- Narrow NLP is nonetheless influenced by Strong
NLP, and vice versa (to the extent that the
mechanisms proposed in Narrow NLP to get systems
up and running help to revise and make more
plausible the theories underlying Strong NLP).
- In this module, we will look at both types of
NLP, and distinguish them as necessary.
- Potential exam questions
- The Characteristics of Natural Language: Why Bother?
- Natural language is the area of study of linguistics.
Looking at what linguists have to say about the
characteristics of natural language is very valuable in
NLP for two reasons:
- Linguists have studied many languages and hence have
described and proposed mechanisms for handling
phenomena that may lie outside those handled by existing
NLP systems (many of which have been developed
to only handle English).
- Linguists have studied the full range of natural
processes, from sound perception to meaning, in
some cases for centuries (~2500 and 1300 years,
respectively, in the
case of Sanskrit and Arabic linguists), and have had
insights that may be invaluable in mitigating known
problems with existing NLP systems, e.g., Marcus and
Davis (2019, 2021).
- The Characteristics of Natural Language: Overview
- General model of language processing (adapted from
BKL, Figure 1.5):
- "A LangAct" is an utterance in spoken or
signed communication or a sentence in
written communication.
- The topmost left-to-right tier encodes
language recognition and the right-to-left
tier immediately below encodes language
production.
- The knowledge required to perform each step
(all of which must be acquired by a human
individual to use language) is the third tier,
and the lowest is the area
or areas of linguistics devoted to each
type of knowledge.
- Three language-related processes are thus encoded
in this diagram: recognition and production
(explicitly) and acquisition (implicitly).
- Whether one is modeling actual human language or implementing
language abilities in machines, the mechanisms underlying
each of language recognition, production, and acquisition
must be able to both (1) handle actual human language inputs and
outputs and (2) operate in a computationally efficient manner.
- As we will see over the course of this module, the
need to satisfy both of these criteria is one of
major explanations for how research has proceeded
(and diverged) in Strong and Narrow NLP.
- The Characteristics of Natural Language: Phonetics (J&M, Section 7)
- Phonetics is the study of the actual sounds produced in
human languages (phones), both from the
perspectives of
how they are produced by the human vocal
apparatus (articulatory phonetics) and
how they are characterized as sound waves
(acoustic phonetics).
- Acoustic phonetics is particularly important
in artificial speech analysis and synthesis.
- In many languages, the way words are written does
not correspond to the way sounds are pronounced
in those words, e.g.,
variation in pronouncing [t] (J&M, Table 7.9):
- aspirated (in initial position),
e.g., "toucan"
- unaspirated (after [s]),
e.g., "starfish"
- glottal stop (after vowel and before [n],
e.g., "kitten"
- tap (between vowels),
e.g., "butter"
The most common way of representing phones is to use
special sound-alphabets,
e.g., International Phonetic Alphabet
(IPA) [charts],
ARPAbet [Wikipedia]
(J&M, Table 7.1).
- Each language has its own characteristic repertoire of
sounds, and no natural language (let alone English)
uses the full range of possible sounds.
- Example: Xhosa tongue twisters (clicks)
- Even within a language, sounds can vary depending on their context, due to
constraints on the way articulators can move in
the vocal tract when progressing between adjacent
sounds (both within and between words) in an utterance.
- word-final [t]/[d] palatalization (J&M, Table 7.10),
e.g., "set your", "not yet", "did you"
- Word-final [t]/[d] deletion (J&M, Table 7.10)
e.g., "find him", "and we", "draft the"
- Variation in sounds can also depend on a variety of
other factors, e.g., rate of speech, word
frequency, speaker's state of mind / gender /
class / geographical location (J&M, Section 7.3.3).
- As if all this wasn't bad enough, it is often very
difficult to isolate individual sounds and words
from acoustic speech signals
Friday, January 30 (Lecture #2)
(FS)
[Class Notes]
- The Characteristics of Natural Language: Phonology (Bird (2003))
(Youtube)
- Phonology is the study of the systematic and
allowable ways in which sounds are realized
and can occur together in human languages.
- Each language has its own set of
semantically-indistinguishable
sound-variants, e.g., [pat] and [p^hat] are the
same word in English but different words in Hindi.
These semantically-indistinguishable variants of a sound
in a language are grouped together as phonemes.
- Variation in how phonemes are realized
as phones is often systematic, e.g., formation
of plurals in English:
| mop | mops | [s] |
| pot | pots | [s] |
| pick | picks | [s] |
| kiss | kisses | [(e)s] |
| mob | mobs | [z] |
| pod | pods | [z] |
| pig | pigs | [z] |
| pita | pitas | [z] |
| razz | razzes | [(e)z] |
Note that the phonetic form of the plural-morpheme /s/ is
a function of the last sound (and in particular, the
voicing of the last sound) in the word being pluralized.
A similar voicing of /s/ often (but not always) occurs
between vowels, e.g., "Stasi" vs. "Streisand".
- Such systematicity may involve non-adjacent sounds,
e.g., vowel harmony in the formation of plurals in Turkish (Kenstowicz (1994),
p. 25):
| dal | dallar | ``branch'' |
| kol | kollar | ``arm'' |
| kul | kullar | ``slave'' |
| yel | yeller | ``wind'' |
| dis | disler | ``tooth'' |
| g"ul | g"uller | ``race'' |
The form of the vowel in the plural-morpheme /lar/ is
a function of the vowel in the word being pluralized.
- It is tempting to think that such variation is encoded
in the lexicon, such that each possible form of a word
and its pronunciation are stored. However, such
variation is typically productive wrt new words (e.g.,
nonsense words, loanwords), which suggests that variation
is instead the result of processes that transform abstract
underlying lexical forms to concrete surface
pronounced forms.
- Plural of nonsense words in English, e.g.,
blicket / blickets, farg / fargs, klis / klisses.
- Unlimited concatenated morpheme vowel Harmony
in Turkish (Sproat (1992), p. 44):
c"opl"uklerimizdekilerdenmiydi =>
c"op + l"uk + ler + imiz + de + ki
+ ler + den + mi + y + di
``was it from those that were in our garbage cans?''
In Turkish, complete utterances can consist of a single
word in which the subject of the utterance is
a root-morpheme (in this case, [c"op], "garbage")
and all other (usually syntactic, in languages
like English) relations are indicated by
suffix-morphemes. As noted above, the vowels
in the suffix morphemes are all subject to
vowel harmony. Given the in principle unbounded
number of possible word-utterance in Turkish,
it is impossible to store them (let alone their
versions as modified by vowel harmony) in a
lexicon.
- Variation in English [r] according
Cantonese phonology in Chinenglish
(Demers and Farmer (1991), Exercise 3.10):
- [r] deletion (before consonant or final
sound in word), e.g., "tart", "party",
"guitar", "bar", "sergeant" (see also Boston English)
- [r] pronounced as [l] (before a vowel), e.g.
"strawberry","brandy", "aspirin", "Listerine",
"curry".
These rules can also hold when a Cantonese speaker
writes English text, e.g., "Phone Let Kloss."
- Modification of English loanwords in Japanese
(Lovins (1973)):
| dosutoru | ``duster'' |
| sutoroberri` | `strawberry'' |
| kurippaa` | `clippers'' |
| sutoraiki | ``strike'' |
| katsuretsu | ``cutlet'' |
| parusu | ``pulse'' |
| gurafu | ``graph'' |
Japanese has a single phoneme /r/ for the
liquid-phones [l] and [r]; moreover, it also
allows only very restricted types of
multi-consonant clusters. Hence, when words that
violate these constraints are borrowed from another
language, those words are changed
by the modification to [r] or
deletion of [l] and the insertion of vowels to break
up invalid multi-consonant clusters.
- Each language has its own phonology, which consists of
the phonemes of that language, constraints on
allowable sequences of phonemes (phonotactics),
and descriptions of how phonemes are instantiated as
phones in particular contexts.
The systematicities
within a language's phonology must be consistent with
(but are by no means limited to those dictated solely
by) vocal-tract articulator physics, e.g.,
consonant clusters in Japanese.
- Courtesy of underlying phonological representations
and processes, one is never sure if an observed
phonetic representation corresponds directly to the
underlying representation or is some process-mediated
modification of that representation, e.g.,
we are never sure if an observed phone corresponds to
an "identical" phoneme in the underlying representation.
This introduces one type of ambiguity into natural language
processing.
- The Characteristics of Natural Language: Morphology
(Trost (2003); Sproat (1992), Chapter 2)
- Morphology is the study of the systematic and
allowable ways in which symbol-string/meaning pieces
are organized into words in human languages.
- A symbol-string/meaning-piece is
called a morpheme.
- Two broad classes of morphemes: roots and affixes.
- A root is a morpheme that can exist alone as word or
is the meaning-core of a word, e.g., tree (noun),
walk (verb), bright (adjective).
- An affix is a morpheme that cannot exist independently
as a word, and only appears in language as part of word,
e.g., -s (plural), -ed (past tense; 3rd person),
-ness (nominalizer).
- A word is essentially a root combined with zero or more affixes.
Depending on the type of root, the affixes perform particular
functions, e.g., affixes mark plurals in nouns and
subject number and tense in verbs in English.
- Morphemes are language-specific and are stored in a language's
lexicon. The morphology of a language consists
of a lexicon and a specification of how morphemes are combined
to form words (morphotactics).
- Morpheme order typically matters, e.g.,
uncommonly, commonunly*, unlycommon* (English)
- There are a number of ways in which roots and affixes can be
combined in human languages (Trost (2003), Sections 2.4.2
and 2.4.3):
- Prefix: An affix attached to the front of the
root, e.g.,, the negative marker un- for
adjectives in English (uncommon, infeasible,
immature).
- Suffix: An affix attached to the back of the
root, e.g.,, the plural marker -s for
nouns in English (pots, pods, dishes).
- Circumfix: A prefix-suffix pair that must both
attach to the root, e.g.,, the past participle
marker ge-/-t for verbs in German
(gesagt "said", gelaufen "ran").
- Infix: An affix inserted at a specific position in
a root, e.g., the -um- verbalizer for nouns
and adjectives in Bontoc (Philippines):
| /fikas/ | "strong" |
/fumikas/ | "to be strong" |
| /kilad/ | "red" |
/kumilad/ | "to be red" |
| /fusl/ | "enemy" |
/fumusl/ | "to be an enemy" |
- Template Infix: An affix consisting of a sequence
of elements that are inserted at specific positions
into a root (root-and-template morphology),
e.g., active and passive markers -a-a- and
-u-i- for the root ktb ("write") in Arabic:
| katab | kutib | "to write" |
| kattab | kuttib | "cause to write" |
| ka:tab | ku:tib | "correspond" |
| taka:tab | tuku:tib | "write each other" |
- Reduplication: An affix consisting of a whole or
partial copy of the root that can be prefix, infix, or
suffix to the root, e.g., formation of the
habitual-repetitive in Javanese:
| /adus/ |
"take a bath" |
/odasadus/ |
| /bali/ |
"return" |
/bolabali/ |
| /bozen/ |
"tired of" |
/bozanbozen/ |
| /dolan/ |
"recreate" |
/dolandolen/ |
- As with phonological variation, there are several lines of
evidence which suggest that morphological variation is not
purely stored in the lexicon but rather the result of processes
operating on underlying forms:
- Productive morphological combination simulating complete
utterances in
words, e.g., Turkish example above.
- Morphology operating over new words in a language,
e.g., blicket -> blickets, television -> televise
-> televised / televising, barbacoa (Spanish) ->
barbecue (English) -> barbecues.
- As if all this didn't make things difficult enough, different
morphemes need not have different surface forms,
e.g, variants of "book"
"Where is my book?" (noun)
"I will book our vacation." (verb: to arrange)
"He doth book no argument in this matter, Milord." (verb: to tolerate)
- Courtesy of phonological transformations operating both
within and between morphemes and the non-uniqueness
of surface forms noted above, one is never sure if an observed
surface representation corresponds directly to the
underlying representation or is a modification of that
representation, e.g., is [blints] the plural of
"blint" or does it refer to a traditional Jewish cheese-stuffed
pancake? This introduces another type of ambiguity into natural
language processing.
- The Characteristics of Natural Language: Syntax (BKL, Section 8;
J&M, Chapter 12; Kaplan (2003))
Monday, February 2 (Lecture #3)
(FS)
[Class Notes]
- The Characteristics of Natural Language: Semantics, Discourse, and
Pragmatics (Lappin (2003); Leech and Weisser (2003); Ramsay (2003))
- Semantics is the study of the manner in which meaning is
associated with utterances in human language;
discourse and pragmatics focus, respectively, on how meaning
is maintained and modified over the course of multi-person
dialogues and how these people chose different
individual-utterance and dialogue styles to communicate
effectively.
- Meaning seems to be very closely related to syntactic
structure in individual utterances; however, the meaning
of an utterance can vary dramatically depending on the
spatio-temporal nature of the discourse and the goals
of the communicators, e.g., "It's cold outside."
(statement of fact spoken in Hawaii; statement of fact
spoken on the International Space Station; implicit order
to close window).
- Various mechanisms are used to maintain and direct focus within
an ongoing discourse (Ramsay (2003), Section 6.4):
- Different syntactic variants with subtly different
meanings, e.g., "Ralph stole my bike" vs
"My bike was stolen by Ralph".
- Different utterance intonation-emphasis, e.g.,
"I didn't steal your bike" vs "I didn't steal your BIKE"
vs "I didn't steal YOUR bike" vs "I didn't STEAL your bike".
- Syntactic variants which presuppose a particular
premise, e.g., "How long have you been beating
your wife?"
- Syntactic variants which imply something by not
explicitly mentioning it, e.g.,
"Some people left the party at midnight" (-> and
some of them didn't), "I believe that she loves me"
(-> but I'm not sure that she does).
- Another mechanism for structuring discourse is to use
references (anaphora) to previously discussed
entities (Mitkov (2003a)).
- There are many kinds of anaphora (Mitkov (2003a),
Section 14.1.2):
- Pronominal anaphora, e.g.,
"A knee jerked between Ralph's legs and he fell
sideways busying himself with his pain as the
fight rolled over him."
- Adverb anaphora, e.g., "We shall
go to McDonald's and meet you there."
- Zero anaphora, e.g., "Amy looked
at her test score but was disappointed with the
results."
- Though a convenient conversational shorthand, anaphora can
be (if not carefully used) ambiguous, e.g.,
"The man stared at the male wolf. He salivated at the
thought of his next meal."
"Place part A on Assembly B. Slide it to the right."
"Put the doohickey by the whatchamacallit over there."
- As demonstrated above, utterance meaning depends not only the
individual utterances, but on the context in which those
utterance occur (including
knowledge of both the past utterances in a dialogue and
the possibly unknown and dynamic goals and knowledge of all
participants in the discourse), which adds yet another
layer of ambiguity into natural language processing ...
- It seems then that natural language, by virtue of its structure
and use, encodes both a lot of ambiguity and variation, as well
as a wide variety of structures at all levels.
- The amount and type of ambiguity and variation that must be
accommodated by
mechanisms in a natural language processing system varies
depending on the goals of the system:
- If the system must operate fluently on unstructured
input relative to unstructured conversational domains,
every form of ambiguity and variation mentioned above must
be dealt with.
- If the system must operate only relative to a restricted
set of conversational forms and topics, e.g.,
a telephone-accessed airline flight status information
system (especially one that asks questions in a structured
manner to control user inputs), the amount of ambiguity
(and to a degree variation) at
the discourse, syntax, and morphological levels can be
dramatically restricted.
- If the system in addition operates relative to a single
known user relative to which that system can be trained,
e.g., an on-line personal assistant, the amount
of ambiguity at the phonological and phonetic levels
(as well as variation at all levels) can also be
dramatically restricted.
- In any case, there is a final set of computational characteristics
of natural language that must be either accommodated (in the
case of strong NLP systems) or circumvented by clever means (in
the case of narrow NLP systems).
- The Characteristics of Natural Language: Natural Language
Operation and Acquisition
- Human beings (both adults and children) use language effectively
with the aid of finite and
ultimately limited computational devices, e.g.,
our brains.
- Utterances are generated and comprehended relatively quickly
in most everyday situations, cf., bad telephone
connections active construction sites.
- Generation and performance acquisition degrades more or
less gracefully with the increasing complexity and length
of utterances.
- Example: Prototypical natural language acquisition device
- Given exposure to a human language, any child can acquire
the phonological, morphological, and semantic (if not total
lexical) knowledge of that language within four to seven years.
- This exposure does not need to be simplified, cf.,
Motherese.
- This exposure does not need to be comprehensive or
exhaustive, with many examples of each possible
phenomena in the language.
- Individual utterances in this exposure need not be
either grammatical or have their grammaticality-status
stated.
- Produced utterances by the child need not be assessed
for grammaticality or corrected.
- Given exposure to any human language, any human being can
acquire that language, and children under the age of 10 years
are superbly good at this. In addition:
- If children are exposed to two distinct
language simultaneously, they will acquire both
languages, i.e., they will become bilingual.
- If children are exposed to a pidgin created by
combining elements of two or more languages, they
will regularize the unsystematic parts of the pidgin
to create a proper hybrid language (creole).
- Given all the characteristics of natural language discussed in the
previous lectures and their explicit and implied
constraints on what an NLP system must do, what then are appropriate
computational mechanisms for implementing NLP systems? It is
appropriate to consider first what linguists, the folk who have been
studying natural language for the longest time, have to say on this
matter.
- NLP Mechanisms: The View from Linguistics
- Three basic types of mechanisms are required to implement
NLP:
- Sequences (to represent linguistic signals at
various levels of processing).
- Sets (to represent sets of linguistic entities,
e.g., lexicons)
- Processes (to implement proposed linguistic
processes operating both between and within
levels of processing).
- Given that linguistic signals are expressed as
temporal (acoustic and signed speech) and spatial
(written sequences) sequences of elements, there must be
a way of representing such sequences.
- Sequence elements can be atomic (e.g., symbols)
or have their own internal structure (e.g.,
feature matrices (phones / phonemes), form-meaning bundles
(morphemes)); for simplicity, assume
for now that elements are symbols.
- There are at least two types of such sequences
representing underlying and surface forms.
- Where necessary, hierarchical levels of
structure such as syntactic parse trees can be encoded
as sequences by using appropriate interpolated and nested
brackets, e.g., "[[the quick brown fox] [[jumped
over] [the lazy brown dog]]]"
- The existence of lexicons implies mechanisms for
representing sets of element-sequences, as well
as accessing and modifying the members of those sets.
- The various processes operating between underlying and
surface forms presuppose mechanisms implementing those
processes.
- The most popular implementation of processes is as
rules that specify transformations of one form
to another form, e.g., add voice
to the noun-final plural morpheme /s/ if
the last sound in the noun is voiced.
- Rules are applied in a specified
order to transform an underlying form to its
associated surface form for utterance production,
and in reverse fashion to transform a surface form to
its associated underlying form(s) for utterance
comprehension (ambiguity creating
the possibility of multiple such associated forms).
- Rules can be viewed as functions. Given the various
types of ambiguity we have seen, these functions are
at least one-many.
- In each linguistic level, the functions corresponding
to individual rules can be composed in rule application
order to create "master" functions that transform
between representations on adjacent linguistic levels, e.g.,
In this diagram, the inverse of function X() is
written as X-().
- The above corresponds to classical appproaches to NLP;
alternatively, natural language can be seen as a single
function that is the composition of all of these
individual level-functions or their inverses, e.g.,
for meaning m and utterance u, uttterance
generation is u = G(m) = P2(P1(M(S(m)))))
and utterance comprehension is
m = C(u) = S-(M-(P2-(P1-(u))))).
- This alternative is that taken by many neural
network implementations of NLP (which will be
discussed later in this module).
- Some such systems like ChatGPT go further by
going directly from utterances to replies, e.g.,
u' = R(u) = P2(P1(M(S(S-(M-(P2-(P1-(u))))).
This raises the question if such systems ever
actually compute (even internally) meanings associated with
utterances at all, cf. Hinton's "Thought Vectors"
(see notes on Encoders/Decoders below).
- Great care must be taken in establishing what parameters
are necessary as input to these functions and that such
parameters are available. For example, a syntax
function can get by with a morphological analysis
of an utterance, but a semantics function would seem
to require as input not only possible syntactic
analyses of an utterance but also discourse context
and models of discourse participant knowledge and
intentions.
- Following practice in linguistics, we can assume that all
natural language processes operate in a deterministic
fashion. However, given the complexity of proposed processes
and how they may interact, it may be useful as an
intermediate stage in analyzing linguistic complexity to
allow mechanisms that operate probabilistically,
e.g., rules that have a specified probability of
application, composed systems of rules that rank one-many
outputs by their probability of occurrence.
- This is analogous to the use of statistical mechanics to
summarize and analyze the overall behavior of large
groups of particles in physics.
- Such mechanisms may be analysis end-points
in themselves if it turns out that either the
interaction of deterministic linguistic processes is
too complex to untangle or there are genuinely
probabilistic processes
underlying natural language processing in humans.
- Beware of over-simplification for the sake of
engineering convenience and equating the model with
the phenomena, e.g., complex deterministic one-many ->
simplified probabilistic one-many -> simplified
probabilistic one-one / most probable (the last of
which is what neural-network-based NLP systems do).
- Potential exam questions
Wednesday, February 4 (Lecture #4)
(FS)
[Class Notes]
- NLP Mechanisms: Finite-state automata (FSA) (Martin-Vide (2003), Section 8.3.1;
Nederhof (1996); Roche and Schabes (1997a), Section 1.2)
- Originated in the Engineering community; inspired by
the discrete finite neuron model proposed by
McCullough and Pitts in 1943.
- Basic Terminology: (start / finish) state,
(symbol-labeled) transition, acceptance.
- Operation: generation / recognition of strings
- Generation builds a string from left to right, adding
symbols to the right-hand end of the string as one
progresses along a transition-path from the start
state to a final state.
- Recognition steps through a string from left to right,
deleting symbols from the left-hand end of the string
as one progresses along a transition-path from the
start state to a final state.
- Note that one need only keep track of one FSA-state
during both operations described above, namely, the
last state visited.
- Example #1: A DFA for the set
of all strings over the alphabet {a,b} that
consist of one or more concatenated copies of ab.
- Types of FSA:
- Deterministic (DFA): At each state and for
each symbol in the alphabet, there is at most
one transition from that state labeled with
that symbol.
- Non-Deterministic (NFA): At each state, there
may be more than one transition from that
state labeled with a particular symbol and/or
there may be transitions labeled with special
symbol epsilon (denoted in our diagrams by
symbol E).
- Revised notion of string acceptance: see
if there is any path through the
NFA that accepts the input string.
- Example #2: An NFA for the set
of all strings over the alphabet {a,b} that
start with an a and end with a b (uses
multiple same-label outward transitions).
- DFA can recognize strings in time linear in the
length of the input string, but may not be
compact; NFA are compact but may require
time exponential in the length of a string to
recognize that string (need to follow all
possible computational paths to check string
acceptance).
- Example #3: An NFA for the set
of all strings over the alphabet {a,b} that
start with an a and end with a b (uses
epsilon-transitions).
- Oddly enough, deterministic and non-deterministic
FSA are equivalent in recognition power, i.e.,
there is a DFA for a particular set of strings iff
there is an NFA for that set of strings.
- Example #4: A DFA for the set
of all strings over the alphabet {a,b} that
start with an a and end with a b.
- Application: Building lexicons
- The simplest types of lexicons are lists of
morphemes which, for each morpheme, store the
surface form and syntactic / semantic
information associated with that morpheme.
- A list of surface forms can be stored as FSA in two ways:
- DFA encoding of a trie (= retrieval tree)
- Deterministic acyclic finite automaton (DAFA)
One creates a trie by compacting a word-list and a DAFA
for a word-list by compacting a trie-DFA for that
word-list.
- Example #5: A trie-DFA
encoding the lexicon {"in", "inline", "input", "our",
"out", "outline", "output"}.
- Example #6: A DAFA
encoding the lexicon {"in", "inline", "input", "our",
"out", "outline", "output"}.
- Application: Implementing morphotactics
- More complex lexicons need to take into account how
morphemes are combined to form words.
- Example #7: A DFA implementing the
constraint that the voicing of the
noun-final plural morpheme must match the
voicing of the final sound in the noun being
pluralized.
- The simplest types of purely concatenative morphotactics
can be encoded as a finite-state manner by indicating
for each morpheme-type sublexicon which types of
morphemes can precede and follow morphemes in
that sublexicon, e.g., nouns precede
pluralization-morphemes in English.
- More complex morphotactics, e.g., circumfixes,
infixes, reduplication, can be layered on top of
lexicon FSA using additional mechanisms
(Beesley and Karttunen (2000); Beesley and Karttunen
(2003), Chapter 8; Kiraz (2000)).
- NLP Mechanisms: Finite-state transducers (FST) (J&M, Section 3.4, Mohri (1997);
Roche and Schabes (1997a), Section 1.3)
- Generalization of FSA in which each transition is
labeled with a symbol-pair (either or both of
which may be the special symbol epsilon).
- The first symbol in each pair is called the lower
symbol and the second is called the upper symbol.
- The alphabet of all lower-symbols in an FST need
not have a non-empty intersection with the alphabet
of all upper-symbols in an FST.
- Example #1: An FST for transforming
between the set
of all strings over the alphabet {a,b} that consist of
n, n >= 0, concatenated copies of ab and the set
of all strings over the alphabet {c,d} that
consist of n, n >= 0, concatenated copies of cd.
- Operation: generation / recognition of string-pairs,
reconstruction of upper (lower) string associated
with given lower (upper) string.
- Generation of a string-pair builds that pair from left
to right, adding symbol-pairs to the right-hand end of
the string-pair as one progresses along a
transition-path from the start state to a
final state.
- Recognition of a string-pair proceeds by stepping
through that pair from left to right, deleting
symbol-pairs from the left-hand end of the string-pair
as one progresses along a transition-path from the
start state to a final state.
- Reconstruction of a string-pair associated with a given
string builds the missing string of the pair from left
to right, adding missing-string symbols to the
right-hand end of the missing string as one progresses
along a transition-path from the start state to a
final state in accordance with the given string.
- As the given string may be either the lower or
upper string, there
are two types of string-pair reconstructions.
- Each reconstructed string-pair corresponds to a
particular
path through the FST guided by the given string.
- Depending on the structure of the FST, there may
be more than one path through the FST consistent
with the given string, and hence more than
one string-pair reconstruction.
- Note that one need only keep track of one FST-state
during all operations described above, namely, the
last state visited.
- Unlike FSA, there are many types of FST and many types
of FST (non-)determinism.
- Example #2: An FST for transforming
between the set
of all strings over the alphabet {a,b} that consist of
n, n >= 0, concatenated copies of ab and the set
of all strings over the alphabet {a,b} that
consist of n, n >= 0, concatenated copies of aba.
- Example #3: An FST implementing the
rule that voice be added to the noun-final plural morpheme
/s/ if the last sound in the noun is voiced.
- Application: Implementing rule-based systems
- Many types of linguistic rules can
be implemented as FST, and the sequential application
of rules can be simulated using the FST created by the
sequential composition (in the specified order) of the FST
corresponding to the rules.
- Application: Implementing lexical transducers (Beesley and
Karttunen (2003), Section 1.8)
- A lexicon FSA can be converted into an FST by
replacing each transition-label x with x:x,
e.g., the identity FST associated an FSA.
This lexicon FST can then be composed with the
the underlying / surface transformation FST
encoding a constraint- or rule-based
morphonological system to create a lexical
transducer.
- Note as FST are bidirectional, so are lexical
transducers; hence, can reconstruct surface or
lexical forms from given lexical or surface
forms with equal ease.
Friday, February 6 (Lecture #5)
(FS)
Example #3: A CFG for an infinite-size subset of
all garden-path utterances, e.g., "The cat that
chased the dog that bit the rat died.":
S -> Np Sp "died" | NP "died"
Np -> "the cat" | "the dog" | "the rat" | "the elephant"
Sp -> that" Vp Sp | "that" Vp
Vp -> V Np
V -> "admired" | "bit" | "chased"
Example #4: A CFG for an infinite-size subset of
the set of all recursively-embedded utterances.
S -> Np Sp "likes tuna fish" | Np "likes tuna fish"
Np -> "the cat" | "the dog" | "the rat" | "the elephant"
Sp -> Np Sp V | Np V
V -> "admired" | "bit" | "chased"
Example #5: A CFG for an infinite-size subset of
the set of all utterances, e.g. "the dog saw a man in the park"
(BKL, pp. 298-299).
S -> Np Vp
Np -> "John" | "Mary" | "Bob" | Det N | Det N Pp
Det -> "the" | "my"
N -> "man" | "dog" | "cat" | "telescope" | "park"
Pp -> P Np
P -> "in" | "on" | "by" | "with"
Vp -> V Np | V Np Pp
Unlike the other sample grammars above, this grammar
is structurally ambiguous because there may be more than
one parse for certain utterances involving prepositional
phrases (Pp) as it is not obvious which noun phrase (Np)
a Pp is attached to, e.g., in "the dog saw the
man in the park", is the man (above) or the dog (below)
in the park?
Note that the parse trees encode grammatical relations between
entities in the utterances, and that these relations have
associated semantics; hence, one can use parse trees as encodings
of basic utterance meaning!
- As shown in Example #3 above, parse trees via structural
ambiguity can nicely encode semantic ambiguity.
Context-free mechanisms can handle many complex linguistic
phenomena like unbounded numbers of recursively-embedded
long-distance dependencies. This done by parsing, which
both recognizes and adds an internal hierarchical
constituent phrase-structure of a sentence.
- Parsing a sentence S with n words relative to a grammar G can be
seen as a
search over all possible parse trees that can be generated
by grammar G with the goal of finding all parse trees
whose leaves are labelled with exactly the words in S in an
order that (depending on the language, is either exactly or
is consistent with) that in S.
Though some algorithms implement the generation of one
valid parse for a given sentence much quicker than
others, the generation of all valid search trees
(which is required in the simplest human-guided schemes
for resolving sentence ambiguities) for all parsing
algorithms is in the worst case at least the number of
valid parse trees for the given sentence S relative
to the given grammar G.
- There are natural grammars for which this is
quantity is exponential in the number of words
in the given sentence (BKL, p. 317; Carpenter (2003),
Section 9.4.2).
Dealing with very large numbers of output
derivation-trees is problematic. What can be done? Can
deal with this by using probability -- that is, by only
computing the derivation-tree with highest probability for
the given utterance!
This is a general way to handle the many-results problem
with ambiguous linguistic processes. There are many
models of probabilistic computing with context-free
grammars and finite-state automata and transducers,
and many of these were state-of-the-art for NLP for
several decades. However, as Li (2022) points out, the
more purely mathematical probabilistic modeling of
natural language started over a century ago with
Markov Models is actually the research tradition that
leads to the final NLP implementation mechanism we shall
examine -- namely, (artificial) neural networks.
NLP Mechanisms: Neural Networks (Kedia and Rasu (2020), Chapters 8-11)
- The finite-state and context-free NLP mechanisms we have studied thus
far in the module are all process-oriented, as they all implement
rules (explicitly in the case of grammars, implicitly via the
state-transitions in automata and transducers) that correspond to
processes postulated by linguists on the basis of observed
human utterances.
- Neural network (NN) NLP mechanisms, in contrast, are
function-oriented, in that the components of these mechanisms typically
do not have linguistic-postulated correlates and the focus is instead
on inferring (with the assistance of massive amounts of observed
human utterances) various functions associated with NLP.
- These functions map linguistic sequences onto categories, e.g.,
spam e-mail detection, sentiment analysis, or other linguistic
sequences, e.g., Part-of-Speech (PoS) tagging, chatbot
responses, machine translation between human languages.
- In many cases, these linguistic sequences are recodings of
human utterances in terms of multi-dimensional numerical
vectors or matrices. There a variety of methods for creating
these vectors and matrices, e.g., bag of words, n-grams,
word / sentence / document embeddings (Kedia and Rasu (2020),
Chapters 4-6; Vajjala et al (2020), Chapter 3).
- Neural network research can be characterized to date in three waves:
- First Wave (1943-1968): McCulloch and Pitts propose abstract
neurons in 1943. Starting in the late 1950s, Rosenblatt explores the
possibilities for representing and learning functions relative
to single abstract neurons (which he calls perceptrons).
The mathematical principles underlying the
back propagation procedure for training neural networks
are developed in the early 1960's (Section 5.5,
Schmidhuber (2015)). Perceptron research is killed off by the
publication in 1968 of Minsky
and Papert's monograph Perceptrons, in which they show
by rigorous mathematical proof that perceptrons are incapable
of representing many basic mathematical functions such as
Exclusive OR (XOR).
- Second Wave (1980-1990): Rumelhart, McClelland, and colleagues
propose and explore the possibilities for multi-level feed-forward
neural networks incorporating hidden layers of artificial
neurons. This is aided immensely by the (re)discovery
of the
backpropagation procedure, which allows efficient learning of
arbitrary functions by multi-layer neural networks. Though it is
shown that these networks are powerful (Universal Approximation
Theorems state that any well-behaved function can be approximated
to an arbitrarily close degree by a neural network with only one
hidden layer), research remains academic as backpropagation on
even small to moderate-size networks is too data- and
computation-intensive.
- It is during this period that NN-based NLP research flourishes,
taking up over a third of the second volume of the
1987 summary work Parallel Distributed Processing (PDP).
Alternatives to feed-forward NN (Recurrent Neural Networks
(see below)) for NLP are also explored by Elman starting
in 1990.
- Third Wave (2000-now): With the availability of massive amounts
of data and computational power courtesy of the Internet and
Moore's Law (as instantiated in special-purpose processors like
GPUs), neural network research re-ignites in a number of
areas, starting with image processing and computer vision. This
is aided by the development of neural networks incorporating
special structures inspired by structures in the human
brain, e.g., Long
Short-Term Memory cells (see below).
Starting around 2010, this wave reaches NLP; the results are
so spectacular that by 2018, NN-based NLP techniques are state of
the art in many applications.
- This is in large part because the more complex
types of neural networks have enabled the creation of
pre-trained NLP models that can subsequently be
customized with relatively little training data for
particular applications, e.g., question answering
systems (Li (2022)).
- Let us now explore this research specific to NLP by examining the
various types of neural networks that have emerged during these
three waves.
Feed-forward Multi-layer Neural Networks (FF-NN) (Kedia and
Rasu (2020), Chapter 8)
Convolutional Neural Networks (CNN) (Kedia and Rasu (2020), Chapter 9)
- First proposed for image processing and computer vision in
the late 1980's, by analogy with known neural structure
in the human visual cortex. CNN are designed to detect
and compute over spatial patterns formed by input elements
that are close together, e.g., lines / line
crossings / corners within images.
- A CNN consists of a series of feature detectors over a 2D
input field, followed by "pooling" layers that summarize
detected features and an FF-NN that takes the outputs of the
pooling layers as inputs.
- Patterns are encoded as small square filter matrices called
kernels, and pattern detection is implemented by
multiplying filter matrices by a version of the 2D
input matrix that is padded with zeroes on the right
and lower edges to ensure that the filter is applied
to all elements of the input matrix.
- Pooling of matrices produced by feature detection
summarizes non-overlapping blocks of values by
various techniques, e.g., block maximum /
average / sum; the resulting downsampling not only
reduces the amount of data input to the FF-NN but also
helps prevent overfitting.
- Though they have been successful in certain applications,
e.g., image classification, CNN are not good
at detecting long-distance sequence or temporal
relationships between input elements.
References
- BBC News (2014) "Translation tech helps firms talk business round the world."
(URL: Retrieved November 14, 2014)
- Baker, J.M., Deng, L., Khuanpur, S., Lee, C.-H., Glass, J., Morgan, N.,
and O'Shaughnessy, D. (2009b) "Research Developments and Directions in
Speech Recognition and Understanding, Part 2." IEEE Signal
Processing Magazine, 26(6), 78-85.
- Beesley, K.R. and and Karttunen, L. (2000) "Finite-state
Non-concatenative Morphotactics." In SIGPHON 2000. 1-12.
[PDF]
- Beesley, K.R. and and Karttunen, L. (2003) Finite-State
Morphology. CSLI Publications.
- Bender, E. and Shah, C. (2022) "All-knowing machines are a fantasy."
IAI News. (Text)
- Bird, S. (2003) "Phonology." In R. Mitkov (ed.) (2003), pp. 3-24.
- Bird, S., Klein, E., and Loper, E. (2009) Natural Language Processing
with Python. O'Reilly Media. [Abbreviated above as BKL]
- Carpenter, B. (2003) "Complexity." In R. Mitkov (ed.) (2003), pp. 178-200.
- Colby, K.M. (1981) "Modeling a paranoid mind." Behavioral and
Brain Sciences, 4(4), 515-534.
- Dutoit, T. and Stylianou, Y. (2003) "Text-to-Speech Synthesis." In
R. Mitkov (ed.) (2003), pp. 323-338.
- Du, M., He, F., Zou, N., Tao, D., and Hu, X. (2024) "Shortcut Learning
of Large Language Models in Natural Language Understanding."
Communications of the ACM, 67(1), 110-19.
[Text]
- Dutta, SW. and Chakraborty, T. (2023) "Thus Spake ChatGPT: On the
Reliability of AI-based Chatbots for Science Communication."
Communications of the ACM, 66(12), 16-19.
[Text]
- Edwards, C. (2021) "The best of NLP." Communications of the
ACM, 64(4), 9-11. [Text]
- Epstein, R. (2007) From Russia With Love: How I got fooled (and somewhat humiliated)
by a computer." Scientific American Mind, October, 6-17.
- Gorevtt, Z. (2023) "The AI emotions dreamed up by ChatGPT."
BBC Future. Accessed February 28, 2023.
[Text]
- Greengard, S. (2023) "Computational Linguistics Finds Its Voice." Communications of the
ACM, 66(2), 18-20. [Text]
- Grishman, R. (2003) "Information Extraction." In R. Mitkov (ed.) (2003), pp. 545-559
- Haigh, T. (2023a) "Conjoined Twins: Artificial Intelligence and the
Invention of Computer Science." Communications of the ACM,
66(6), 33-37. [HTML]
- Haigh, T. (2023b) "There was No 'First' AI Winter." Communications
of the ACM, 66(12), 35-39. [HTML]
- Haigh, T. (2024a) "How the AI Boom Went Bust." Communications
of the ACM, 67(2), 22-26. [HTML]
- Haigh, T. (2024b) "Between the Booms: AI in Winter." Communications
of the ACM, 67(11), 19-23. [HTML]
- Haigh, T. (2025) "Artificial Intelligence: Then and Now." Communications
of the ACM, 68(2), 24-29. [HTML]
- Harabagiou, S. and Moldovan, D. (2003) "Question Answering." In R. Mitkov (ed.) (2003), pp. 560-582
- Hutchins, J. (2003) "Machine Translation: General Overview." In
R. Mitkov (ed.) (2003), pp. 501-511.
- Jurafsky, D. and Martin, J.H. (2008) Speech and Natural Language
Processing (2nd Edition). Prentice-Hall. [Abbreviated above as
J&M]
- Jurafsky, D. and Martin, J.H. (2022) Speech and Natural Language
Processing (3rd Edition).
(Book Website)
[Abbreviated above as J&M2]
- Kaplan, R. (2003) "Syntax." In R. Mitkov (ed.) (2003), pp. 70-90.
- Kay, M. (2003) "Introduction." In R. Mitkov (ed.) (2003), pp. xvii-xx.
- Kedia, A, and Rasu, M. (2020) Hands-On Python Natural Language
Processing. Packt Publishing; Birmingham, UK.
- Kelly, S.M. (2023) "The dark side of Bing's new AI chatbot."
CNN Business. Accessed February 17, 2023.
[Text]
- Kenstowicz, M.J. (1994) Phonology in Generative Grammar.
Basil Blackwell.
- Kiraz, G.A. (2000) "Multi-tiered nonlinear morphology using
multitape finite automata: A case study on Syriac and Arabic."
Computational Linguistics, 26(1), 77-105.
[PDF]
- Lamel, L. and Gauvin, J.-L. R. (2003) "Speech Recognition." In R. Mitkov
(ed.) (2003), pp. 305-322.
- Lappin, S. (2003) "Semantics." In R. Mitkov (ed.) (2003), pp. 91-111.
- Leech, G. and Weisser, M. (2003) "Pragmatics and Dialogue." In
R. Mitkov (ed.) (2003), pp. 136-156.
- Li, H. (2022) "Language models: past, present, and future."
Communications of the ACM, 65(7), 56-63.
(HTML)
- Lovins, J.B. (1973) Loanwords and the Phonological Structure of
Japanese. PhD thesis, University of Chicago.
- Marcus, G. and Davis, E. (2019) Rebooting AI: Building Artificial
Intelligence We Can Trust. Pantheon Books; New York.
- Marcus, G. and Davis, E. (2021) "Insights for AI from the Human Mind",
Communications of the ACM, 64(1), 38-41. [Text]
- Martin-Vide, C. (2003) "Formal Grammars and Languages." In R. Mitkov (ed.)
(2003), pp. 157-177.
- Mitkov, R. (ed.) (2003) The Oxford Handbook of Computational
Linguistics. Oxford University Press.
- Mitkov, R. (2003a) "Anaphora Resolution." In R. Mitkov (ed.) (2003),
pp. 266-283.
- Mohri, M. (1997) "Finite-state transducers in language and speech
processing." Computational Linguistics, 23(2), 269-311.
[PDF]
- Nederhof, M.-J. (1996) "Introduction to Finite-State Techniques."
Lecture notes.
[PDF]
- Ramsay, A. (2003) "Discourse." In R. Mitkov (ed.) (2003), pp. 112-135.
- Reis, E. S. D., Costa, C. A. D., Silveira, D. E. D., Bavaresco, R. S.,
Righi, R. D. R., Barbosa, J. L. V., and Federizzi, G. (2021)
"Transformers aftermath: current research and rising trends."
Communications of the ACM, 64(4), 154-163.
[Text]
- Roche, E. and Schabes, Y. (eds.) (1997) Finite-state Natural Language
Processing. The MIT Press.
- Roche, E. and Schabes, Y. (1997a) "Introduction." In E. Roche and
Y. Schabes (Eds.) (1997), pp. 1-66.
- Schmidhuber, J. (2015) "Deep learning in neural networks: An
overview." Neural Networks, 61, 85-117.
- Somers, H. (2003) "Machine Translation: Latest Developments." In
R. Mitkov (ed.) (2003), pp. 512-528.
- Sproat, R. (1992) Morphology and Computation. The MIT Press.
- Tidy, J. (2024) "Character.ai: Young people turning to AI therapist
bots." BBC Technology. Accessed January 8, 2024.
[Text]
- Trost, H. (2003) "Morphology." In R. Mitkov (ed.) (2003), pp. 25-47.
- Tzoukermann, E., Klavans, J.L., and Strazalkowski, T. (2003) "Information
Retrieval." In R. Mitkov (ed.) (2003), pp. 529-544.
- Vajjala, S., Majumder, B., Gupta, A., and Surana, H. (2019)
Practical Natural Language Processing: A Comprehensive Guide to
Building Real-world NLP Systems. O'Reilly; Boston, MA.
- Videla, A. (2023) "Echoes of Intelliogence: Textual Interpretation and
Large Language Models." Communications of the
ACM, 66(11), 38-43. [Text]
- Weizenbaum, J. (1966) "ELIZA - A computer program for the study of
natural language communication between man and machine."
Communications of the ACM, 9(1), 36-45.
- Weizenbaum, J. (1967) "Contextual understanding by computers."
Communications of the ACM, 10(8), 474-480.
- Weizenbaum, J. (1979) Computer power and human reason. W.H. Freeman.
Created: January 26, 2026
Last Modified: January 30, 2026