In this research, a computer system for visual speech recognition has been presented. In the first phase of the system's operation, time-varying visual speech patterns are obtained from the sequence of images. An energy function has been designed to measure how well the template's geometric primitives match the lips' outlines. Using a numerical optimization technique, a good solution is obtained with considerably less computational and storage requirements. A recurrent neural network architecture has been proposed to classify the spatio-temporal pattern obtained in the first phase. In this network, recurrent connections are made between the hidden layer and the state layer so that a context can be combined with the input patterns which are fed to the network one at a time. Training the recurrent network is accomplished by training the feed-forward network embedded in the recurrent architecture. To derive static training samples for the feed-forward network, a certain behavior is specified when the network is presented with sample sequences.

Revised: 99.10.25